[PDF] Stable a posteriori LES of 2D turbulence using convolutional neural networks: Backscattering analysis and generalization to higher Re via transfer learning

Abstract

There is a growing interest in developing data-driven subgrid-scale (SGS) models for large-eddy simulation (LES) using machine learning (ML). In a priori (offline) tests, some recent studies have found ML-based data-driven SGS models that are trained on high-fidelity data (e.g., from direct numerical simulation, DNS) to outperform baseline physics-based models and accurately capture the inter-scale transfers, both forward (diffusion) and backscatter. While promising, instabilities in a posteriori (online) tests and inabilities to generalize to a different flow (e.g., with a higher Reynolds number, Re) remain as major obstacles in broadening the applications of such data-driven SGS models. For example, many of the same aforementioned studies have found instabilities that required often ad-hoc remedies to stabilize the LES at the expense of reducing accuracy. Here, using 2D decaying turbulence as the testbed, we show that deep fully convolutional neural networks (CNNs) can accurately predict the SGS forcing terms and the inter-scale transfers in a priori tests, and if trained with enough samples, lead to stable and accurate a posteriori LES-CNN. Further analysis attributes these instabilities to the disproportionally lower accuracy of the CNNs in capturing backscattering when the training set is small. We also show that transfer learning, which involves re-training the CNN with a small amount of data (e.g., 1%) from the new flow, enables accurate and stable a posteriori LES-CNN for flows with 16x higher Re (as well as higher grid resolution if needed). These results show the promise of CNNs with transfer learning to provide stable, accurate, and generalizable LES for practical use.

Full PDF

SStable a posteriori

LES of 2D turbulence usingconvolutional neural networks: Backscattering analysisand generalization to higher Re via transfer learning Yifei Guan , Ashesh Chattopadhyay , Adam Subel , and Pedram Hassanzadeh , Department of Mechanical Engineering, Rice University, Houston, TX, 77005, United States Department of Earth, Environmental and Planetary Sciences, Rice University, Houston, TX, 77005, United States

Abstract

There is a growing interest in developing data-driven subgrid-scale (SGS) models for large-eddy simulation (LES) using machine learning (ML). In a priori (oﬄine) tests, some recentstudies have found ML-based data-driven SGS models that are trained on high-ﬁdelity data(e.g., from direct numerical simulation, DNS) to outperform baseline physics-based models andaccurately capture the inter-scale transfers, both forward (diﬀusion) and backscatter. Whilepromising, instabilities in a posteriori (online) tests and inabilities to generalize to a diﬀerentﬂow (e.g., with a higher Reynolds number, Re ) remain as major obstacles in broadening theapplications of such data-driven SGS models. For example, many of the same aforementionedstudies have found instabilities that required often ad-hoc remedies to stabilize the LES at theexpense of reducing accuracy. Here, using 2D decaying turbulence as the testbed, we showthat deep fully convolutional neural networks (CNNs) can accurately predict the SGS forcingterms and the inter-scale transfers in a priori tests, and if trained with enough samples, lead tostable and accurate a posteriori LES-CNN. Further analysis attributes these instabilities to thedisproportionally lower accuracy of the CNNs in capturing backscattering when the training setis small. We also show that transfer learning, which involves re-training the CNN with a smallamount of data (e.g., 1%) from the new ﬂow, enables accurate and stable a posteriori

LES-CNNfor ﬂows with 16 × higher Re (as well as higher grid resolution if needed). These results showthe promise of CNNs with transfer learning to provide stable, accurate, and generalizable LESfor practical use. Accurate simulations of turbulent ﬂows are of critical importance for predicting and understand-ing various engineering and natural systems. However, the direct numerical simulation (DNS) ofthe Navier-Stokes equations remains computationally prohibitive for many real-world applicationsbecause DNS requires resolving (i.e., directly solving for) all the relevant spatial and temporalscales. These scales might span several orders of magnitude, e.g., from the domain length downto the Kolmogorov scale [62, 81]. Large-eddy simulation (LES) oﬀers a balance between accuracyand computational cost, since in LES, only the part of the inertial range containing the large-scalestructures is resolved on a coarse-resolution grid and the eﬀects of the subgrid-scale (SGS) eddiesare parameterized, in terms of the resolved ﬂow, using a SGS model [81, 87]. As a result, the ∗ [email protected] † [email protected] a r X i v : . [ phy s i c s . f l u - dyn ] F e b uality of the solutions from LES highly depends on the quality of the SGS model. Consequently,formulating accurate SGS models for LES has been an active area of research for the past fewdecades in diﬀerent disciplines [e.g., 18, 58, 83, 87, 90, 91]. Below, we brieﬂy describe some ofthe key physics-based SGS models and their major shortcomings, which have motivated the recentinterest in using machine learning (ML) to ﬁnd data-driven SGS models. Then, we discuss some ofthe advances in data-driven SGS modeling as well as the main challenges, some of which we aimto address in this paper.In his pioneering work on developing one of the ﬁrst global climate models, Smagorinsky pro-posed a physics-based SGS model for LES in 1963 [95]. In this model (SMAG, hereafter), eﬀectsof the SGS eddies are parameterized as a function of the resolved ﬂow using a scale-selective dis-sipative model that consists of a positive eddy viscosity ν e and second-order diﬀusion. Since then,the SMAG model and its variants have been widely used in diﬀerent disciplines, for example, tosimulate weather and climate variability, combustion, multiphase ﬂows, wind farms, and magneto-hydrodynamics [e.g., 1, 26, 27, 40, 77, 80, 86, 97, 102]. Such purely diﬀusive SGS models lead tonumerically stable LES; however, they might not correctly capture the inter-scale physical processessuch as energy (and enstrophy) transfers. These models often include second-order dissipation buthigher orders can also contribute to the forward transfer, i.e., transfer from the resolved scales tothe subgrid scales [58]. Furthermore, while in the mean, the transfer is forward and the role ofSGS processes is indeed dissipative, it is known that locally there can be transfer from the subgridscales to the resolved scales. This process is referred to as backscattering, which is missing frompurely diﬀusive SGS models [81].Backscattering has been found to play a signiﬁcant role in various ﬂuid ﬂows, and extensivework has been done in diﬀerent disciplines to account for it in physics-based SGS models [e.g.,9, 19, 38, 39, 47, 53, 66, 92, 103, 121]. For example, Piomelli et al. [78, 79] showed that the lackof energy backscattering in LES could lead to inaccurate prediction of the perturbation growthin transitional wall-bounded ﬂows. Backscattering has been also found to be important in geo-physical turbulence, which has implications for modeling atmospheric and oceanic circulations andweather/climate predictions [6, 32, 33, 35, 64, 85, 93]. To improve the SMAG model and account forbackscattering, Germano et al. [30] developed a dynamic approach to compute the eddy viscosity,which could lead to ν e < ν e ≥ ν e , from high-ﬁdelitydata (e.g., DNS or observations) [22, 57, 89, 91, 96, 112]. Alternatively, a growing number of2ecent papers have aimed to learn a data-driven SGS model from high-ﬁdelity data, often in anon-parametric fashion, i.e., without any prior assumption about the model’s structural/functionalform [e.g., 28, 29, 36, 50, 70, 74, 82, 88, 101, 107, 108]. In the studies from the latter categorythat focused on LES, a variety of canonical ﬂuid systems and diﬀerent approaches (e.g., local vs.non-local) have been investigated. In the local approach, which often employs multilayer percep-tron artiﬁcial neural networks (ANNs), the SGS term (stress tensor or its divergence) at a gridpoint is estimated in terms of the resolved ﬂow at or around the same grid point. For example,Maulik et al. [56] and Xie et al. [113, 114] have, respectively, studied 2D decaying homogenousisotropic turbulence (2D-DHIT) and 3D incompressible and compressible turbulence using this ap-proach (also, see [116]). In the non-local approach, which often employs variants of convolutionalneural networks (CNNs), the SGS term over the entire domain is estimated in terms of the re-solved ﬂow in the entire domain to account for potential spatial correlations (e.g., due to coherentstructures) and non-homogeneities in the system. For example, Zanna and Bolton [7, 120], Beckand colleagues [4, 44], Pawar et al. [75], and Subel et al. [99] have used this approach for oceancirculation, 3D-DHIT, 2D-DHIT, and forced 1D Burgers’ turbulence, respectively.In a priori (oﬄine) tests, in which the accuracy of the SGS model in estimating the SGS term asa function of the resolved ﬂow is evaluated, some of these studies have found the data-driven SGSmodels to accurately account for inter-scale transfers (including backscattering) and outperformphysics-based models such as SMAG and DSMAG [7, 56, 75, 120, 122]. However, most of the samestudies have also found that in a posteriori (online) tests, in which the data-driven SGS model iscoupled with a coarse-resolution numerical solver, the LES model is unstable, leading to numericalblow-up or physically unrealistic ﬂows [4, 5, 44, 56, 98, 115, 120, 122]. While the reason(s) for theseinstabilities remain unclear, a number of remedies have been proposed, e.g., post-processing of thetrained SGS model to remove backscattering or to attenuate the SGS feedback into the numericalsolver, or combining the data-driven model with an eddy viscosity model [4, 56, 120, 122] (also, seethe excellent review by Beck and Kurz [5]). However, such remedies include ad-hoc components andoften substantially take away the advantages gained from the non-parametric, data-driven approach.Instabilities in a posteriori tests remain a major challenge to broadening the applications ofML-based data-driven SGS models for LES. Another major challenge is the generalization capabil-ity of the data-driven SGS models beyond the ﬂow from which the training data are obtained, e.g.,extrapolation to turbulent ﬂows with higher Reynolds numbers ( Re ). The ability to generalize isimportant for at least two reasons: i) High-ﬁdelity data from usually expensive simulations (e.g.,DNS) are needed to train data-driven SGS models and given the sharp increase in the computa-tional cost of DNS with Re , the ability to eﬀectively extrapolate to higher Re makes data-drivenSGS models much more useful in practice; and ii) Some level of generalization capability in thedata-driven SGS models is essential for the LES models to be robust and trustworthy. However,it is known that such extrapolations are challenging for neural networks in general [43]. In LESmodeling, a priori tests in 3D-DHIT have shown that the performance of data-driven SGS modelsdegrades when applied to Re higher than the one for which the model is trained. In a posteriori tests with multi-scale Lorenz 96 systems [16] and forced 1D Burgers’ turbulence [99], we foundinaccurate generalization to more chaotic systems or ﬂows with higher Re , particularly in terms ofshort-term prediction and re-producing the long-term statistics of rare events. However, in bothstudies, we also found that transfer learning, which involves re-training (part of) the already trainedneural network using a small amount of data from the new system [118], enables accurate general-ization, e.g., to 10 × higher Re [16, 99]. While promising, the eﬀectiveness of transfer learning inenabling generalization in more complex turbulent ﬂows needs to be investigated.3uilding on these earlier studies, here we use a deep fully CNN architecture to build a non-localdata-driven SGS model for a 2D-DHIT system using DNS data, and aim to(a) Examine the accuracy of this SGS model in a priori (oﬄine) tests, with regard to bothpredicting the SGS terms and capturing inter-scale transfers,(b) Evaluate the accuracy and stability of LES with this SGS model (LES-CNN) in a posteriori (online) tests, both in terms of short-term predictions and long-term statistics,(c) Assess the eﬀectiveness of transfer learning in enabling accurate and stable generalization ofLES-CNN to higher Re (up to 16 × ). We also show generalization to higher grid resolutionsby adding an encoder-decoder architecture to the CNN.For (a) and (b), we also present results from the SMAG and DSMAG models as well as a localANN-based data-driven SGS model.The remainder of this paper is structured as follows. Governing equations of the 2D-DHITsystem, the ﬁltered equations, and the DNS and LES numerical solvers are presented in Section 2,followed by descriptions of the data-driven SGS models (training data and the CNN and ANNarchitectures) and the physics-based SGS models (SMAG and DSMAG) in Section 3. Resultsof the a priori and a posteriori tests as well as generalization to higher Re and/or resolutions viatransfer learning are presented in Section 4. Conclusions and future work are discussed in Section 5. The dimensionless governing equations of 2D-DHIT in the vorticity ( ω ) and streamfunction ( ψ )formulation in a doubly periodic x − y domain are ∂ω∂t + N ( ω, ψ ) = 1 Re ∇ ω, (1a) ∇ ψ = − ω, (1b)where the nonlinear term N ( ω, ψ ) represents advection N ( ω, ψ ) = ∂ψ∂y ∂ω∂x − ∂ψ∂x ∂ω∂y . (2)2D turbulence is a ﬁtting prototype for many large-scale geophysical and environmental ﬂows(where rotation and/or stratiﬁcation dominate) and has been widely used as a testbed for noveltechniques, including ML-based SGS modeling [e.g., 10, 100, 103–105]. In DNS, as discussed indetail in Section 2.2, Eqs. (1a)-(1b) are numerically solved at high spatio-temporal resolutions.To ﬁnd the equations for LES, we apply ﬁltering (denoted by ( · ) and deﬁned later) to Eqs. (1a)-(1b) to obtain ∂ω∂t + N ( ω, ψ ) = 1 Re ∇ ω + N ( ω, ψ ) − N ( ω, ψ ) | {z } Π , (3a) ∇ ψ = − ω. (3b)4ote that in deriving these equations, we assume that the ﬁlter commutes with the spatial (andtemporal) derivative operators, which is the case for commonly used ﬁlters such as box, sharpspectral, and Gaussian ﬁlters [81, 86]; the latter is used in this work (see Section 3.1). As discussedin Section 2.2, ths numerical solution of Eqs. (3a)-(3b) requires spatio-temporal resolutions lowerthan those of the DNS. However, the SGS forcing term, Π, includes the eﬀects of the small-scaleeddies that have been truncated due to ﬁltering/coarse-graining and are not resolved in LES. As aresult, Π has to be estimated solely based on the resolved variables ( ω, ψ ) to close Eqs. (3a)-(3b),a problem that is at the heart of turbulence modeling [81].In most physics-based models, such as those using eddy viscosity, Π is modeled as a purelydiﬀusive process (SMAG and DSMAG are described in Section 3.4). In data-driven approaches, suchas the one pursued here and discussed in Sections 3.2 and 3.3, the aim is to learn the relationshipbetween ( ω, ψ ) and Π in DNS data using methods such as deep neural networks, without any priorassumptions about the functional form of this relationship. For DNS, we solve Eqs. (1a)-(1b) in a doubly periodic square domain with L × L = [0 , π ] × [0 , π ].A Fourier-Fourier pseudo-spectral solver is used along with second-order Adams-Bashforth andCrank-Nicolson time-integration schemes for the advection and viscous terms, respectively. Thecomputational grid has uniform spacing ∆ DNS = 2 π/N

DNS , where N DNS is the number of grid pointsin each direction. We use N DNS = 2048 for Re = 8000 , , and 64000, and N DNS = 3072 for Re = 128000. The time-stepping size ∆ t DNS = 10 − (∆ t DNS = 5 × − ) is used for N DNS = 2048( N DNS = 3072). Following Refs. [55, 56], the initial condition of each DNS run is a random vor-ticity ﬁeld but with the same prescribed energy spectrum (see Appendix A for details). For eachof the Re mentioned above, we conducted 15 independent DNS runs from random initial conditions.The numerical solver is implemented in Python using CUDA GPU computing. We use equalnumbers of GPU blocks as the resolution in each direction such that only one GPU thread in eachblock is assigned for the computation on one computational grid point. The fast Fourier transform(FFT) and inverse fast Fourier transform (iFFT) operations are performed using the cuFFT library.Double-precision ﬂoating-point arithmetic is used for all numerical solvers.Figure 1 shows an example of the vorticity ﬁeld for Re = 32000 at the initial condition ( t = 0),and at t = 50 τ and t = 200 τ , where τ = 1 / | ω | max = 0 .

02 = 200∆ t DNS ( | ω | max is computed at t = 0).After around 50 τ , the turbulent kinetic energy (TKE) spectrum ( ˆ E ( k )) exhibits self-similarity. Notethat the TKE spectrum is calculated using an angle average and therefore k = q k x + k y .For LES, we solve Eqs. (3a)-(3b) using the same numerical solver used for DNS, except thatthe spatial resolution is lower by a factor of 8 in each direction (i.e., N LES = N DNS / LES =8∆

DNS ) and the time-stepping size is 10 times larger, ∆ t LES = 10∆ t DNS . As a result, the LESsolver requires 640 times fewer degrees of freedom, which substantially reduces the computationalcost. However, the LES solver needs a SGS model for Π. Here, we use two data-driven modelsthat employ CNN and ANN as well as two common physics-based models (SMAG and DSMAG).In the next section, we ﬁrst describe the ﬁltered DNS (FDNS) data, which are used for trainingthe data-driven SGS models, and then describe the CNN, ANN, SMAG, and DSMAG models.5 −

25 0 25 50(a) t = 0 (b) t = 50 τ (c) t = 200 τ (d) ˆ E ( k ) spectrumt=0t=50 τ t=200 τ k − − − − − − − − Wavenumber, k Figure 1:

An example of the vorticity ﬁeld of DNS for Re = 32000 at a) t = 0, b) t = 50 τ , and c) t = 200 τ .The initial turbulent kinetic energy (TKE) spectrum is prescribed while the vorticity ﬁeld has random phase(see Appendix A). Data collection for training of data-driven SGS models (using CNN or ANN) starts from t = 50 τ and ends at t = 200 τ . As shown in (d), in this period, the TKE spectra exhibit self-similarityfollowing the k − scaling of the Kraichnan-Batchelor-Leith (KBL) theory [3, 41, 46]. Data-driven and physics-based SGS models for LES

To compute the ﬁltered DNS variables on the LES grid, which as mentioned above is 8 × coarserthan the DNS grid in each direction, we i) apply the Gaussian ﬁlter transfer function to the DNSdata, and ii) coarse-grain the ﬁltered results to the LES grid [81, 120]. Below, the subscript “DNS”denotes the high-resolution DNS grid and “LES” denotes the coarse-resolution LES grid.Using vorticity as an example, we ﬁrst transform the DNS vorticity ﬁeld ω ( r DNS ) into thespectral space ˆ ω ( k DNS ), where r = ( x, y ) and k = ( k x , k y ). Then, we apply the Gaussian ﬁlter inthe spectral space ˜ˆ ω ( k DNS ) = G ( k DNS ) (cid:12) ˆ ω ( k DNS ) , (4)where the operator (cid:12) means element-wise multiplication of matrices and ˜( · ) denotes the ﬁlteredvariable at the DNS resolution. The transfer function of the Gaussian ﬁlter is [81]: G ( k DNS ) = e −| k DNS | ∆ F / , (5)where ∆ F is the ﬁlter size, which is taken to be ∆ F = 2∆ LES to yield suﬃcient resolution [81, 122].After the ﬁltering operation, coarse-graining is performed to transform the ﬁltered solution fromthe DNS to LES grid [81, 120]:ˆ ω ( k LES ) = ˜ˆ ω ( | k x | < k c , | k y | < k c ) (6)where k c = π/ ∆ LES is the cut-oﬀ wavenumber in spectral space, and we use ( · ) to denote the ﬁl-tered and then coarse-grained variables (hereafter, we use the term “ﬁltered” for both “ﬁltered” andthen “coarse-grained” when there is no ambiguity). ˆ ψ ( k LES ) and ˆΠ( k LES ) are similarly computedfollowing Eqs. (4)-(6).Note that in addition to the Gaussian ﬁlter, box and sharp Fourier ﬁlters are also commonlyused for LES. However, the Gaussian ﬁlter is compact in both physical and spectral spaces [81].Because our numerical solver is in the Fourier spectral space and our CNN and ANN operate inthe physical space, we focus on the Gaussian ﬁlter for LES. Furthermore, Zhou et al. [122] foundthat the Gaussian ﬁlter outperforms the other two ﬁlters in terms of correlation coeﬃcients of Πin their work on data-driven SGS modeling of 3D turbulence.Figure 2 shows examples of the Π term and eﬀects of ﬁltering on the vorticity ﬁeld in physicalspace and on the TKE spectrum ( ˆ E ( k )). The ﬁne structures in DNS vorticity ω are lost in ﬁlteredvorticity ˜ ω and manifest themselves in SGS vorticity ω = ω − ˜ ω and the SGS forcing term Π(panels (c)-(d)). The ˆ E ( k ) spectrum further shows the eﬀects of the Gaussian ﬁlter on the energyat smaller scales (panel (e)). The Gaussian ﬁlter leads to the deviation of the FDNS spectrum fromthe DNS spectrum, especially at the scales near k c .Our goal is to non-parameterically learn Π as function of the FDNS variables ω and ψ using adeep fully CNN as well as an ANN used in a previous study [56]. For non-local data-driven SGS modeling, we propose to use a deep fully CNN. The CNN architecturewas originally developed for computer vision and image processing and its key feature is that rather7a) ω (b) ˜ ω (e) ˆ E ( k ) spectrum and Gaussian ﬁlter(c) ω = ω − ˜ ω (d) Π −

50 500 − −

10 100 −

10 10 − − − − − − − k − k c DNSFDNSGaussian ﬁlterWavenumber, k Figure 2:

Examples showing the eﬀects of ﬁltering. a) DNS vorticity ω , (b) ﬁltered vorticity ˜ ω , (c) SGSvorticity ω = ω − ˜ ω , (d) SGS forcing term Π, and (e) TKE spectrum for Re = 32000 at the end of one ofthe DNS runs ( t = 200 τ ). Panel (e) also shows the transfer function of the Gaussian ﬁlter and the cutoﬀwavenumber, k c . The FDNS spectrum deviates from the DNS spectrum near k c because of the ﬁltering. than having pre-deﬁned ﬁlters, CNNs learn the ﬁlters used for pattern recognition for a given dataset [31, 42, 45]. CNNs have often been found superior to ANNs when the data contains spatialpatterns and structures signiﬁcant to the functional relationship to be learned [21, 76]. Therefore, itis not surprising that CNNs have been found to perform well, usually superior to non-convolutionalML methods, in applications involving turbulent ﬂows, given the abundance of coherent structuresand spatial correlations in turbulence [e.g., 4, 7, 12, 15, 60, 75]. Speciﬁcally for SGS modeling, arecent a priori analysis has shown that CNN outperforms local ANN in terms of prediction accuracyof the SGS stress term in the same 2D-DHIT system studied here [75].Building on previous work and to account for non-local eﬀects (e.g., coherent structures andspatial correlations), we use a CNN with inputs/outputs that are global (i.e., from the entiredomain). Thus, the input features are ( ψσ ψ , ωσ ω ) ∈ R × N LES × N LES , (7)and the output targets are ( Π σ Π ) ∈ R N LES × N LES , (8)where σ is the standard deviation of the corresponding variables calculated over all training samples.We aim to use a CNN to learn M , an optimal map between the inputs and outputs M : n ψ/σ ψ , ω/σ ω o ∈ R × N LES × N LES → n Π /σ Π o ∈ R × N LES × N LES (9)8 ψ ¯ ω Π Convlayer Convlayer Convlayer Convlayer Convlayer

Figure 3:

Schematic of the CNN. Inputs and outputs are samples of normalized ( ¯ ψ, ¯ ω ) and Π, respectively.The convolutional layers (Conv layers) have the same dimension (256 × ×

5. The activation function of each layer is ReLu (rectiﬁed linearunit) except for the last one, which is a linear map. by minimizing the mean-squared-error (

M SE ) M SE = 1 n tr n tr X i =1 k Π CNN i − Π FDNS i k , (10)where n tr is the number of training samples and k · k is the L norm.Figure 3 schematically shows the CNN architecture that is used here. We use the mini-batchstochastic gradient descent method with the Adam optimizer to minimize the loss function, Eq. (10).Note that the CNN has no pooling or upsampling layers (i.e., fully CNN), so the hidden layers havethe same size as the input and output layers. We have found that using a fully CNN (i.e. withoutan up/down sampling) is a key to training an accurate SGS model, consistent with earlier ﬁndingsthat pooling layers may artiﬁcially change spatial correlations of the data [15].Hyper-parameters such as the number of hidden layers have been determined via an extensivesearch. We ﬁnd that to capture the complex pattern of Π, a deep CNN with 10 hidden layers isneeded. For example, the 10-layer CNN outperforms shallower 8-layer and 5-layer CNNs in termsof training loss for the same n tr . Overall, the CNN with 10 layers has 927041 trainable parameters.The training, validation, and testing sets are generated using 2D snapshots of ﬁltered datacollected from 15 independent DNS runs with random initial conditions, sampled every 10∆ DNS ,in the time interval [50 τ, τ ]. We use the data from 8 runs for the training set, 2 runs for thevalidation set, and 5 runs for the testing set. The eﬀects of the size of the training dataset on theaccuracy of the SGS model is further discussed in Section 4.1.As is the common practice in ML applications, we run the CNN (and the ANN) with single-precision ﬂoating-point operations during both training and testing to accelerate the process andreduce the data transfer/storage. We have also explored training/testing a CNN with double-precision ﬂoating-point arithmetic, but found no distinguishable enhancement in the a posteriori tests.Finally, we point out that the codes for CNN and CNN with transfer learning (discussed later)are made publicly available on GitHub (see the Acknowledgement for details).9 .3 Multilayer perceptron artiﬁcial neural network (ANN) A few recent studies have proposed building local data-driven SGS models using ANNs trained tolearn the mapping between a local stencil of input variables to the local SGS term Π [56, 114, 116,122]. For example, Maulik et al. [56] employed such an approach for the same 2D-DHIT systemand proposed to train an ANN with inputs consisting of 9 grid stencil values of ω and ψ plus thelocal values of | S | and |∇ ω | and the output consisting of the local SGS term Π value: M : n ω i,j , ω i,j +1 , ..., ω i − ,j − , ψ i,j , ψ i,j +1 , ..., ψ i − ,j − , | S | i,j , |∇ ω | i,j o ∈ R → n Π i,j o ∈ R , (11)where ( i, j ) here denotes a local grid point. | S | is the characteristic ﬁltered rate of strain [81] and |∇ ω | = r(cid:16) ∂ω∂x (cid:17) + (cid:16) ∂ω∂y (cid:17) .We have closely followed Ref. [56] in building a local data-driven SGS model. For the ANN, weuse their publicly available code. The ANN is fully connected with 2 hidden layers, each containing50 neurons. The network has 3651 trainable parameters. We explore architectures with more layersand neurons per layer, but ﬁnd no improvement in the accuracy. Due to the use of local inputs,in this approach the number of training samples is equal to the number of snapshots multiplied by N . In common practice, only a few (less than 10) snapshots of data is used as the training dataset [56, 114, 115, 122]. Here, following Ref. [56], we use 8 randomly selected snapshots (from thetraining set mentioned in Section 3.2) resulting in 524288 samples in the training sets. We havealso investigated the eﬀects of increasing the number of samples to 20 snapshots, but again, nosubstantial improvement in training loss is found. Note that following Ref. [56], no pre-processing,e.g., normalization, is performed on the input or output data (we ﬁnd normalizing the input/outputsamples to have no eﬀect on the performance of the ANN).Note that it is not the purpose of this paper to compare the ANN- and CNN-based approachesside by side (even if such comparison is possible given the diﬀerences in architecture, networksize, input/output, and size of the training set). Therefore, beyond the explorations mentionedabove, we have not performed an exhaustive search on the ANN and local SGS modeling approach.Our explorations all suggest that the comprehensively investigated network/approach presented inRef. [56] is already optimal. In the SMAG [95] model, which is a commonly used baseline SGS model for LES, the SGS stressterm in the momentum equation is modeled as [81, 86]: τ SMAG = − C s ∆) h S S i / S, (12)where the angle brackets h·i denote domain averaging. S is the ﬁltered rate-of-strain tensor [81].The SGS term Π in Eq. (3a) is therefore:Π SMAG = ( C s ∆) h S S i / ∇ ω = ν e ∇ ω, (13)where C s is the Smagorinsky coeﬃcient, ν e is the eddy viscosity, and h S S i / = vuut ∂ ψ∂x∂y ! + ∂ ψ∂x − ∂ ψ∂y ! . (14)10able 1: Correlation coeﬃcients c (Eq. (15)) between the predicted and true SGS term Π for Re = 32000 in a priori tests. The subscripts indicate c computed only over elements of Π F DNS and Π M corresponding to T >

T < c . ± .

06 0 . ± .

02 0 . ± . c T > . ± .

06 0 . ± .

02 0 . ± . c T < . ± .

02 0 . ± . C s is a constant in the SMAG model. The DSMAG model [30] uses a dynamic procedure toestimate ν e based on the local ﬂow structure. This procedure can lead to ν e <

0, which can resultin numerical stabilities; consequently, “positive clipping” is often applied to enforce ν e ≥ C s = 1 for SMAG following Maulik et al. [56] and implement DSMAG (with positiveclipping) following Pawar et al. [75], who studied the same 2D-DHIT system. Note that theseSMAG and DSMAG models both have ν e ≥ A priori analysis

We ﬁrst examine the accuracy of the CNN-based SGS model in predicting the Π term and inter-scaletransfers for never-seen-before samples of ( ψ, ω ) from the testing set. The results in Section 4.1.1are reported for n tr = 50000. We use a commonly used metric, the correlation coeﬃcient c betweenthe modeled (Π M ) and true (Π FDNS ) SGS terms deﬁned as [4, 75, 121]: c = D(cid:16) Π M − h Π M i (cid:17)(cid:16) Π FDNS − h Π FDNS i (cid:17)ErD (Π M − h Π M i ) ErD (Π FDNS − h Π FDNS i ) E . (15)The correlation coeﬃcients (averaged over 100 random testing samples) for CNN as well as DSMAGand ANN are reported in Table 1. These a priori tests show that the data-driven SGS models sub-stantially outperform DSMAG, and that this CNN-based model (with c above 0 .

9) has statisticallysigniﬁcantly higher accuracy than this ANN-based model. Note that similarly, previous ﬁndingsbased on correlation coeﬃcients of SGS stress term found CNNs to outperform ANNs in a priori tests [75].Next, we examine the inter-scale transfer in a priori tests. The transfer is often quantiﬁedusing the SGS stress [81]. Since here we are working with the SGS forcing term Π, which is thecurl of the divergence of the SGS stress, we instead follow previous work and deﬁne SGS transfer T as [38, 56, 103]: T = sgn ( ∇ ω ) (cid:12) Π , (16)where sgn ( · ) is the sign function. At each grid point ( i, j ), T i,j > T i,j < ω (b) T F DNS (c) T CNN (d) T ANN (e) T DSMAG

BackscatterForwardtransfer −

50 100 − Figure 4:

Example of inter-scale transfer T , Eq. (16), in a priori analysis at Re = 32000. a) Filteredvorticity ω ; b) true T from FDNS; (c)-(e) T from CNN, ANN, and DSMAG. The ANN and CNN captureboth forward transfer and backscatter while DSMAG only captures the forward transfer (diﬀusion). Theupper row shows the entire domain while the second row shows the portion in the black square. cascade, which is a physical property [81, 104]. For a sample ﬁltered vorticity ω , Fig. 4 shows thetrue inter-scale transfer T FDNS and T from CNN, ANN, and DSMAG. Because DSMAG is purelydiﬀusive, it only captures the forward transfer. The ANN and CNN both capture the diﬀusion aswell as backscattering. Table 1 further shows c computed separately over grid points correspondingto only T >

T < c > . a priori tests show that the CNN-based data-driven SGS model can accu-rately predict the out-of-sample SGS forcing terms and inter-scale transfers. However, as discussedin the Introduction, previous studies have found that accuracy in a priori tests does not necessarilytranslate to accuracy/stability in a posteriori analysis [4, 56, 120, 122]. Before discussing the aposteriori tests in Section 4.2, we further examine how the accuracy of the CNN depends on thesize of the training set, which as it turns out, impacts the stability of LES-CNN. n tr Table 2 shows how the SGS term’s correlation coeﬃcient c varies in a priori tests as the numberof samples used to train the CNN ( n tr ) is increased. The value of c increases with n tr , reaching0 .

90 with n tr = 10000 and 0 .

93 with n tr = 50000. While c = 0 .

90 (for n tr = 10000) might seemhigh enough and the CNN-based data-driven SGS model might seem accurate enough, a set of aposteriori tests with this LES-CNN model are found to lead to noisy, unphysical ﬂows for someinitial conditions. In fact, a posteriori tests with LES-CNN trained with lower n tr (500 or 1000)lead to numerically unstable simulations that blow-up. Only simulations with n tr ≥ a posteriori tests might be due to inaccurateout-of-sample predictions as a result of insuﬃcient training data. These ﬁndings are consistent12able 2: Correlation coeﬃcients c (Eq. (15)) between the CNN-predicted and true SGS term Π for Re = 32000 in a priori tests as a function of the number of training samples n tr . The values showthe average over (the same) 100 randomly chosen testing samples and the standard deviation. Thelast row indicates the fate of a posteriori LES-CNN integrations from 5 random initial conditions:unstable refers to numerical blow-up, unphysical refers to simulations leading to noisy/unrealisticﬂows, and stable refers to numerically stable and accurate simulations. n tr

500 1000 10000 30000 50000 c . ± .

05 0 . ± .

04 0 . ± .

04 0 . ± . c T > . ± .

05 0 . ± .

03 0 . ± .

04 0 . ± .

04 0 . ± . c T < . ± .

04 0 . ± .

03 0 . ± .

04 0 . ± .

04 0 . ± . a posteriori LES-ANN, which was traced to inaccurate Πterms around some of the shockwaves. In that study, we showed that artiﬁcially enriching the train-ing dataset using a data augmentation strategy [25, 68, 117] led to a stable and accurate LES-ANN.Table 2 further reports c T > and c T < as a function of n tr . This analysis shows that consistently, c T < is lower than c T > , especially at small n tr , but the diﬀerence declines from 0 .

15 to 0 .

04 withincreasing n tr . The implication of these results is that the SGS model with a CNN trained usinga small n tr is less capable of accurately predicting backscattering than forward transfer, which,based on previous ﬁndings, could lead to instabilities. As discussed in the Introduction, capturingbackscattering is highly desired; however, it is known from physics-based SGS modeling eﬀortsthat it can lead to instabilities if handled incorrectly [49, 59]. Moreover, in recent data-driven SGSmodeling eﬀorts, as discussed later, removing backscattering has been used as a way of stabilizing a posteriori LES [56, 122]. Table 2 shows that at least for our CNN, the backscattering can beaccurately captured and the a posteriori

LES can be stable without any further post-processing ifthe training set is large enough.In short, these results suggest that neural networks that may “seem” well-trained and accuratein a priori (oﬄine) tests, may not be suﬃcient for stable/accurate LES in a posteriori (online) tests.We say “seem” because there is no established a priori metric and threshold to know if a data-driven SGS model is well-trained and accurate enough to lead to stable and accurate a posteriori

LES. In this study, the threshold is empirically between c = 0 .

90 and c = 0 .

92, or if c T < is a bettermetric, between 0 .

89 and 0 .

91. To be clear, these are just empirical thresholds in this testcase, andsuch thresholds might be case-dependent. Whether a general connection between a data-drivenSGS models’ accuracy in a priori tests and the a posteriori

LES stability could be established ornot should be thoroughly investigated in future work. Furthermore, we emphasize that we do notclaim that all instabilities in other a posteriori

LES runs using data-driven SGS models (reportedin other studies) are due to similar inaccuracies that could be reduced by enriching the trainingset.

A posteriori analysis

In the a posteriori (online) tests, the CNN-based data-driven SGS model and the LES numericalsolver of Eqs. (3a)-(3b) are coupled (LES-CNN): at a given time step, the resolved variables ( ψ, ω )from the numerical solver are normalized (dividing by their σ ) and fed into the already trained13 e = 8000 Re = 32000 FDNSLES-CNNLES-ANN w/ Eq. (17)LES-ANNLES-DSMAG τ E ( t ) /E Figure 5:

Evolution of kinetic energy E ( t ) normalized by E = E (0) in a posteriori tests from 5 randominitial conditions at Re = 8000 and Re = 32000. Note that for each Re , the ANN- and CNN-based data-driven SGS models have been trained on data from that Re . Curves show the mean from the 5 integrations.The LES integrations start at t = 50 τ . All stable LES models overpredict the decay rate but LES-CNN isclosest to the FDNS while LES-DSMAG, and even more so the post-processed LES-ANN with backscatteringremoved, are too dissipative. LES-ANN without post-processing is unstable and blows up. CNN, which predicts Π

CNN . This Π

CNN is then de-normalized (multiplying by σ Π ) and fed backinto the numerical solver to compute the resolved ﬂow in the next time step, and the cycle con-tinues. The CNN used for the a posteriori tests is trained with n tr = 50000 and leads to stableLES-CNN in all tests conducted here. Similarly, we use the ANN-based data-driven SGS modeland the physics-based SGS model SMAG and DSMAG to conduct LES-ANN, LES-SMAG, andLES-DSMAG integrations.Figure 5 shows examples of the evolution of the kinetic energy E ( t ) = −h ψω i / Re = 32000 as well as for Re = 8000. Whilethe LES-CNN and LES-DSMAG are stable, LES-ANN is unstable, leading to rapid increases in E and blow up. In their pioneering work, Maulik et al. [56] also found this LES-ANN unstable andproposed a post-processing step: Π ANN i,j = 0 , ∀ T i,j < , (17)which eﬀectively, like the positive clipping used for DSMAG, eliminates backscattering based on T from Eq. (16). A similar procedure was used by Zhou et al. [122] to stabilize their LES-ANN for3D-DHIT. While this post-processed LES-ANN is stable, it is excessively dissipative (even morethan DSMAG) and substantially overpredicts the energy decay rate. LES-CNN, which is stablewithout any post-processing and accounts for both diﬀusion and backscattering, has the closestagreement with FDNS in terms of the decay rate. It should be pointed out that it is possible thatincreasing the number of training samples for the ANN also leads to a more accurate and perhapsa stable LES-ANN; however, as mentioned before, the focus of this work is on LES-CNN and acomprehensive investigation of LES-ANN is beyond the scope of this paper. We present the results14 ES-CNNLES-DSMAGLES-SMAGLES-ANN w/ Eq. (17)

Relativeerror: error L time/ τ Short-term prediction accuracy of LES models in a posteriori tests at Re = 32000. Predictionsstart at t = 50 τ in the 5 testing sets. For each model, curves show the evolution of the relative L -normerror, error L ( t ) = k ¯ ω LES − ¯ ω FDNS k / k ¯ ω FDNS k , averaged over the 5 integrations. The LES-CNN hasthe highest accuracy and outperforms LES-DSMAG. The large error in the post-processed LES-ANN andin LES-SMAG is due to excessive dissipation (see Fig. 7). with LES-DSMAG as a baseline and present the results with the recently published LES-ANN togive the readers a better view of the state-of-the-art in this ﬁeld.To examine the accuracy of LES-CNN in short-term forecasting, Fig. 6 presents the relative L -norm error in the prediction of ω averaged from 5 random initial conditions in the testing setfrom t = 50 τ to 200 τ . The results show that LES-CNN has the highest accuracy, outperformingthe next best model, DSMAG. The post-processed LES-ANN and LES-SMAG have substantiallyhigher errors, which as the next analysis shows is due to their excessive dissipation. To furtherevaluate the short-term accuracy of these LES models, Fig. 7 shows an example of ω ( x, y ) at t = 100 τ, τ, and 200 τ predicted from an initial condition at t = 50 τ in the testing set. Evi-dently, LES-CNN is capable of predicting both small- and large-scale structures well, and outper-forms LES-DSMAG, which while capturing most of the large-scale structures well, misses manyof the small-scale structures. The post-processed LES-ANN and LES-SMAG are too diﬀusive andmiss most small-scale structures, substantially underpredicting the magnitude of ω , especially atlater times.The above analysis shows that the superior accuracy of the CNN-based SGS model in a pri-ori tests translates to high accuracy in short-term forecasts with LES-CNN in a posteriori tests.Next, we examine the accuracy of these a posteriori LES models in reproducing the statistics ofthe turbulent ﬂow, which is an important test for the applicability of these models [63]. Figure 8shows the TKE spectrum and probability density function (PDF) of vorticity at t = 200 τ from the5 simulations in the testing sets. Among the LES models, LES-CNN has the best performance: itsTKE spectrum matches that of the FDNS across wavenumbers and its PDF matches that of theFDNS, even at the end of the tails. The next best-performing model is LES-DSMAG, whose TKEspectrum overall agrees with FDNS, although this model is more diﬀusive than LES-CNN. The ex-cessive diﬀusion is more noticeable in the PDF of the vorticity ﬁeld: while the PDF of LES-DSMAG15DNSLES-CNNLES-ANNw/ Eq. (17)LES-DSMAGLES-SMAG -40 -20 0 20 40 t = 100 τ t = 150 τ t = 200 τ Figure 7:

Examples of the vorticity ﬁelds at t = 100 τ, τ, and 200 τ from one of the testing sets at Re = 32000. ω from FDNS is shown in the ﬁrst row (used as the “truth” for the LES). Rows 2-5 show ω predicted from t = 50 τ using 4 a posteriori LES models. The LES-CNN captures the patterns andmagnitudes of both large- and small-scale structures well, except at the latest time at t = 200 τ . While LES-DSMAG predicts most of the large-scale structures and some of the small-scale structures well, particularlyat the earlier times, its overall accuracy is lower than that of LES-CNN (also see Fig. 6). The post-processed LES-ANN has a reasonably good performance at t = 100 τ , but at later times, this model andthe LES-SMAG model are too diﬀusive such that the magnitude of the vorticity ﬁeld is underpredicted andsmall-scale structures are missing. matches the bulk of the FDNS’ PDF, there are large deviations at the tails, beyond ± E ( k ) spectrum k − ω/σ ω Wavenumber, k − − − − − − − − -6 -4 -2 0 2 4 610 − − − − − − − DNSFDNSLES-CNNLES-CNN w/ Eq. (17)LES-DSMAGLES-SMAGLES-ANN w/ Eq. (17)

Figure 8:

The TKE spectrum ˆ E ( k ) and probability density function (PDF) of vorticity at t = 200 τ from a posteriori tests at Re = 32000. Results are from independent runs in the 5 testing sets. For ˆ E ( k ), thespectrum from each run is calculated and then averaged. For the PDF, data from all 5 runs are combined andthe PDF is calculated using a kernel estimator [109]. For both the TKE spectrum and PDF, the LES-CNNhas the best performance, followed by LES-DSMAG. Results from post-processed LES-CNN with Eq. (17)are shown just to demonstrate the importance of capturing backscattering for the excellent performance ofLES-CNN in reproducing the TKE spectrum of FDNS and the tails of the FDNS’ PDF. The post-processedLES-ANN with Eq. (17) and LES-SMAG are too diﬀusive, which shows in both TKE spectrum and PDF. to TKE spectra that quickly curl down as k increases and PDFs that substantially deviate fromthe FDNS’ PDF at the tails (for LES-SMAG, even in the bulk). Just to further demonstrate theimportance of capturing backscattering in the outstanding performance of LES-CNN in matchingthe FDNS’ spectrum and PDF, Fig. 8 also presents results from a post-processed LES-CNN withEq. (17) (i.e., backscattering removed), showing that the model becomes excessively diﬀusive (withperformance comparable to that of the LES-DSMAG).The a posteriori results show the advantages of the CNN-based data-driven SGS model, whichprovides a stable LES model while capturing backscattering, and yields superior performance forboth forecasting short-term spatio-temporal evolution and reproducing long-term statistics of theturbulent ﬂow. Re So far, we have tested the data-driven SGS model and the LES-CNN on ﬂows with the same Re as the ﬂow from which data was collected for the training of the CNN ( Re = 8000 or Re = 32000).As discussed in the Introduction, the capability to generalize beyond the training ﬂow, in partic-ular to extrapolate to turbulent ﬂows with higher Re in a posteriori tests, is essential for robust,trustworthy, and practically useful LES models. Neural networks are known to have diﬃculty withextrapolations, and in our recent work with multi-scale Lorenz 96 equations and forced 1D Burgers’turbulence, we found that data-driven SGS models do not generalize well to more chaotic systemsor ﬂows with 10 × higher Re , leading to inaccurate predictions in a posteriori (online) tests [16, 99].17 E ( k ) Test on Re = 8000 Test on Re = 32000 Test on Re = 64000 k − k − k − FDNSLES-CNN Re =8000 FDNSLES-CNN Re =8000 LES-TL-CNN Re =8000 LES-CNN Re =32000 FDNSLES-CNN Re =8000 LES-TL-CNN Re =8000 LES-CNN Re =64000 − − − − Wavenumber, k Figure 9:

Transfer learning to higher Re . The TKE spectrum ˆ E ( k ) at t = 200 τ from a posteriori tests atthree diﬀerent Re . Results are from independent runs in the 5 testing sets. For ˆ E ( k ), the spectrum fromeach run is calculated and then averaged. The superscript indicates the Re on which the CNN is trainedwith n tr = 50000 samples. TL (transfer learned) means that the CNN has been re-trained with n T Ltr = 500samples (1% of n tr ) from the Re on which the LES-CNN is tested on (indicated in the title of each panel).In each panel, the blue lines show that the LES-CNN trained and tested on the same Re is accurate and itsTKE spectrum agrees with that of the FDNS. However, the red lines in the two panels on the right show thatthe LES-CNN trained on Re = 8000 does not perform well at 4 × or 8 × higher Re , with the TKE spectra ofthe simulated ﬂow substantially deviating from that of the FDNS at high k near k c . The red dashed linesshow that the LES-TL-CNN pre-trained on Re = 8000 and transfer learned with a small amount of datafrom the higher Re perform well at 4 × or 8 × higher Re . Similarly, Fig. 9 shows that for the 2D-DHIT system studied here, a data-driven SGS model trainedon data from Re = 8000 leads to a posteriori LES-CNN that is accurate only at Re = 8000 butnot at Re = 32000 or 64000. At these higher Re , the TKE spectra deviate substantially from thespectrum of the FDNS.In both Chattopadhyay et al. [16] and Subel et al. [99], we showed that transfer learning enablesaccurate generalization/extrapolation of data-driven SGS models to more chaotic systems and tur-bulent ﬂows with a 10 × higher Re , although the eﬀectiveness of this approach beyond 1D and tomore complex turbulent ﬂows remained to be investigated.Transfer learning involves taking a neural network that has been already trained for a givendata distribution (e.g., ﬂow with a given Re ) using a large amount of data and re-training onlysome of its layers (usually the deeper layers) using a small amount of data from the new datadistribution (e.g., ﬂow with a higher Re ) [31, 118]. For example, Fig. 10 shows the schematic ofthe transfer-learned CNN used here. While similar to the original CNN (Fig. 3), there is one majordiﬀerence: for transfer learning, the ﬁrst 8 Conv layers use the weights already computed duringtraining with n tr samples from the lower Re . These weights are ﬁxed and remain the same duringthe re-training. The last two Conv layers are initialized with weights computed during trainingwith n tr samples from the lower Re , but these two layers will be trained and their weights willbe updated using n T Ltr = n tr /

100 samples from the higher Re . The key idea of TL is that indeep neural networks, the ﬁrst layers learn high-level features, and the low-level features that arespeciﬁc to a particular data distribution are learned only in the deeper layers [31, 118]. Therefore,for generalization, only the deeper layers need to be re-trained, which can be done using a small18 ψ ¯ ω ΠConvlayer Convlayer Convlayer Convlayer ConvlayerFigure 10:

Schematic of the CNN with transfer learning for extrapolation to higher Re . Everything isthe same as the original CNN shown in Fig. 3 with one exception: here, the ﬁrst 8 Conv layers (gray) usethe weights already computed during training with n tr samples from the lower Re and are ﬁxed (not to betrained). Only the last two Conv layers (blue) are going to be trained using n T Ltr = n tr /

100 samples fromthe higher Re , after these layers are initialized not randomly but using the weights computed for the lower Re . amount of data from the new distribution.To examine the eﬀectiveness of transfer learning in the 2D-DHIT testcase, we take the CNNthat is already trained with n tr samples from Re = 8000 and re-train it with n T Ltr = n tr / Re = 32000 or Re = 64000. Figure 9 shows that the a posteriori LES with these transfer-learned CNNs (LES-TL-CNN Re =8000 ) accurately extrapolates to 4 × and8 × Re . In both cases, the accuracy of the transfer-learned LES-TL-CNN is as good as that ofthe LES-CNN trained with n tr samples from Re = 32000 and Re = 64000. Before showing theresults for accurate extrapolation to even higher Re (16 × ) in the next section, we point out thatthe number of layers to be re-trained and the number of samples used for re-training ( n T Ltr ) dependon the problem and require some trial and error for the best performance. Here, ﬁxing the ﬁrst6 layers and re-training the deeper 4 layers (with the same n T Ltr ) leads to similar LES-TL-CNNperformance. The goal of transfer learning is to minimize n T Ltr while achieving the accuracy of n tr ,with the number of re-trained layers being a hyper-parameter to be tuned to achieve this goal.Substantial exploration in forced 1D Burgers’ turbulence showed that the a posteriori performanceof LES with transfer-learned data-driven SGS models mainly depends on n T Ltr as long as more thanone layer is re-trained [99]. Re and higher LES numerical resolution One often-cited disadvantage of using CNNs (compared to local ANNs) for data-driven SGS mod-eling is dependence on the speciﬁc LES resolution for which the CNN has been trained, limiting theuse of the LES-CNN on a diﬀerent grid resolution (and the use of transfer learning to extrapolateto even higher Re for which a higher LES resolution might be needed). Here, we show that thisissue can be easily addressed by adding pooling (encoder) and upsampling (decoder) layers to thetransfer learning architecture.For example, to use a CNN-based data-driven SGS model trained on data from Re = 8000and resolution 256 ×

256 and conduct a posteriori

LES-TL-CNN integrations at Re = 64000 or Re = 128000 with resolution N LES = 512, we can use the encoder-decoder architecture shown inFig. 11. Here, the number of convolutional layers are the same as before plus an additional layerbefore the encoder. The encoder with a pooling layer with stride two transforms the ﬁrst layerfrom the input size (512 × Re and resolution (256 × n tr samples from the lower Re . These weights are kept ﬁxed and19 ψ ¯ ω ΠConvlayer

Encoder

Conv layers

Decoder

Convlayer ConvlayerFigure 11:

Schematic of the CNN with transfer learning and encoder-decoder architecture for extrapolationto higher Re and higher LES grid resolution. There are few diﬀerences with the CNN shown in Fig. 10. Here,the input and output samples are at the higher resolution of 512 (inputs and outputs of the CNNs in Figs. 3and 10 are at the resolution of 256 ). The 8 Conv layers that are already trained with n tr samples from thelower Re and FDNS at the resolution of 256 ×

256 are embedded within an encoder-decoder architecture.These 8 layers (gray) are ﬁxed (not to be trained). The last two layers (in blue) are initialized not randomlybut using the weights computed for the lower Re and lower resolution. A ﬁrst layer (in blue) is addedbetween the input and the encoder, and is initialized randomly. Only these three layers are going to betrained using n T Ltr = n tr /

100 samples from the higher Re and higher resolution (512 ). these layers are not going to be trained. A decoder transforms the output of the last of these layersfrom the size 256 ×

256 to the size of the ﬁrst of the last two layers, which is 512 × n tr samples from the lower Re (and lower resolution). Only these two layers and the very ﬁrst layerwill be trained and their weights are updated using n T Ltr = n tr /

100 samples from the higher Re andhigher resolution. Here we use a factor of two increase in the resolution in each direction just asan example, and this approach can be used on any other resolution changes too.Figure 12 shows, for Re = 64000 and Re = 128000, the TKE spectrum for LES-TL-CNN incomparison to that of FDNS. In this LES-TL-CNN, the numerical resolution is N LES × N LES =512 ×

512 and its CNN has been trained with n tr = 50000 samples from Re = 8000 with resolution256 and transfer-learned with n T Ltr = n tr /

100 samples from Re = 64000 or Re = 128000 at theresolution of 512 . The results show that transfer learning enables extrapolation to over an order-of-magnitude increase in Re (16 × ) and with the encoder-decoder architecture, enables transferbetween diﬀerent LES resolutions. The implications of these ﬁndings, in particular for practicalpurposes, are discussed in the next section. Using 2D decaying turbulence as the testbed, we have examined the performance of a CNN-based,non-local, data-driven SGS model in a priori and a posteriori analyses, with training and testingdone on data from ﬂows with the same Re . We have also investigated the eﬀectiveness of transferlearning in enabling a posteriori LES-CNNs that are trained on data from ﬂows with low Re (andlow grid resolution) to work for ﬂows with higher Re (and higher grid resolution). In all cases,training is done on ﬁltered DNS data, and the performance is tested in comparison with out-of-sample ﬁltered DNS data.As discussed in Section 4.1, a priori tests at Re = 32000 show that the trained data-drivenSGS model can accurately predict the SGS forcing terms from never-seen-before snapshots of theresolved ﬂow with correlation coeﬃcients c (Eq. (15)) around 0 .

93, substantially outperforminga baseline physics-based SGS model, DSMAG. The data-driven SGS model is also found to accu-20est on Re = 64000 Test on Re = 128000 k − k − DNSFDNSLES-TL-CNN Re =8000 DNSFDNSLES-TL-CNN Re =8000 − − − − − − ˆ E ( k ) Wavenumber, k Figure 12:

Transfer learning to higher Re and higher LES numerical resolution. The TKE spectrum ˆ E ( k )at t = 200 τ from a posteriori tests at two diﬀerent Re . Results are from independent runs in the 5 testingsets. For ˆ E ( k ), the spectrum from each run is calculated and then averaged. The superscript indicates thatthe CNN has been trained with n tr = 50000 samples from Re = 8000 at the resolution of 256 × n T Ltr = 500 samples (1% of n tr ) from the Re on which the LES-TL-CNN is tested on (indicated in the title of each panel) at the resolution of 512 × Re = 64000, N DNS = 2048 and for Re = 128000, N DNS = 3072. The FDNS is at the resolution of 512 . Theblue lines show that the LES-TL-CNN pre-trained on Re = 8000 and transfer learned with a small amountof data from the higher Re and resolution perform well at 8 × or 16 × higher Re . Note that for LES-TL-CNNat both Re , here we use N LES = 512. rately capture both forward transfer and backscattering between the resolved and unresolved scales.To examine the connection between a priori and a posteriori performance, we have evaluatedthe accuracy of a priori tests (in terms of c ) and the stability of a posteriori LES-CNN as thenumber of training samples are varied from n tr = 500 to 50000 (Table 2). This analysis shows thatwhile the SGS model trained with n tr = 10000 seems accurate (with c = 0 . n tr ) is not stable. Increasing n tr to 30000 and 50000 furtherimproves c to 0 .

92 and 0 .

93, respectively, and leads to accurate and stable a posteriori

LES-CNN,without any need for post-processing or additional eddy viscosity. More analysis, in which c is cal-culated separately for grid points experiencing only forward transfer or only backscattering, showsthat at low n tr , the CNN captures backscattering with much lower accuracy compared to forwardtransfer, but that the diﬀerence decreases as n tr is increased. This analysis suggests that the insta-bilities of a posteriori LES-CNN trained with small training sets might be due to the inability ofthe SGS model to correctly represent backscattering. Why learning backscattering requires moredata remains to be studied in future work. This might be because backscattering is fundamentallyharder to learn data drivenly, or because backscattering is less frequent than forward transfer, orboth. While we do not claim that all instabilities in a posteriori (online) tests are due to this issueand could be overcome by increasing n tr , we believe that these ﬁndings can help future studiesin understanding the reasons(s) behind these instabilities and formulating rigorous solutions (see21elow for further discussions).As discussed in Section 4.2, a posteriori tests at Re = 32000 with the CNN trained with n tr = 50000 show that LES-CNN is stable and accurate. The LES-CNN outperforms LES-DSMAGand LES with other tested SGS models in terms of both short-term forecast and re-producing theTKE spectrum and PDF of vorticity (even at the tails). The main shortcoming of the other modelsis that they are too diﬀusive, primarily because they do not capture backscattering due to theirformulation or post-processing steps used to make them stable. The CNN-based SGS model learnsboth forward transfer and backscattering non-parameterically from data, and as mentioned above,once the latter is accurately captured with enough training samples, this SGS model leads to anaccurate and stable a posteriori LES-CNN.The analysis presented in Section 4.3 shows that a data-driven SGS model trained at Re = 8000does not lead to accurate a posteriori LES-CNN solutions (in terms of TKE spectra) at the higher Re , e.g., at Re = 32000 or Re = 64000. However, we show that transfer learning largely solvesthis problem and enables the LES-CNN trained for a ﬂow at low Re to provide accurate and stablesolutions for ﬂows with higher Re while requiring only a small amount of data from the ﬂow athigher Re . The data-driven SGS model can even be coupled with LES solvers that use higher gridresolutions by adding an encoder-decoder architecture to the transfer-learned CNN (Section 4.4).For example, we show that a CNN trained with n tr = 50000 samples from Re = 8000 (at ﬁlteredresolution 256 × a posteriori LES-CNN for ﬂows with Re = 128000 and N LES = 512 once 2 out of the 10 convolution layers of the CNN are re-trainedwith only n T Ltr = n tr /

100 = 500 samples from Re = 128000. To the best of our knowledge, this isthe ﬁrst application of transfer learning to building generalizable data-driven SGS models beyond1D turbulence (the 1D results were presented in our recent work [99]).In summary, in a canonical 2D turbulent ﬂow, we present promising results that CNNs andtransfer learning can be used together to build non-local data-driven SGS models that lead to ac-curate, stable, and generalizable LES models. The generalization capability provided by transferlearning is key in making such data-driven SGS models practically useful. This is because traininga base CNN model with a large training set of high-ﬁdelity data from low Re and then requiringonly a small amount of high-ﬁdelity data from the higher Re for re-training is highly desirable forturbulence modeling, given the sharp increase in the computational cost of high-ﬁdelity simula-tions such as DNS for higher Re . It should be also highlighted that because transfer learning onlyrequires a small amount of data and re-training only a few layers, its training process is fast andhas a low computational cost, thus it can be conducted on the ﬂy, for example when dealing withnon-stationary systems. Moreover, the ability to also transfer between diﬀerent LES resolutionsfurther broadens the applicability of non-local SGS models. While not examined here, it is alsopossible that transfer learning provides generalization beyond Re and grid resolution, for examplebetween canonical ﬂuid systems and ﬂuid ﬂows with more complex geometries. Such applicationsshould be explored in future work.Beyond the obvious need to study the performance of the CNN-based SGS models and transferlearning in more complex turbulent ﬂows (e.g., 3D, wall turbulence, stratiﬁed), there are a numberof avenues to pursue in order to further expand and improve the methodology. The number oftraining samples might be potentially reduced, without loss of accuracy or stability, using dataaugmentation, e.g., through pre-processing the training data by exploiting the symmetries in theﬂow [68, 99], and/or using physics-informed ML [37]. Examples of the latter include adding com-22onents (such as capsules [15] and transformers [14]) that better preserve spatial correlations in theCNN or imposing physical constraints in the loss function [e.g., 37, 111]. Establishing a connectionbetween accuracy in a priori tests and stability in a posteriori tests would also be substantiallyhelpful. Note that in this work (and in most other SGS modeling studies), an “oﬄine training”strategy is used: the SGS model is ﬁrst trained using snapshots of the resolved ﬂow as inputs andsnapshots of the SGS term as outputs, and then this trained data-driven model is coupled with thecoarse-resolution LES solver. At least some of the issues related to stability could be potentiallyresolved, and even scaling with the size of training set could be improved, by using an “online train-ing” strategy, which involves training the data-driven model to ﬁnd the best SGS term that evolvesthe solution of the LES closest to that of the DNS. Sirignano et al. [94] have recently presented anexciting and promising framework for such an approach. Exploring data-driven SGS models thataccount for non-Markovian eﬀects arising from coarse-graining, as suggested by the Mori-Zwanzigformalism [17, 48, 71, 110], is another direction to pursue in future work. Finally, interpretingthe CNNs that provide accurate SGS models, such as the one trained here, can lead to insightinto the SGS physics and possibly even better data-driven and/or physics-based models. Whileinterpreting neural networks is currently challenging, using them along with data-driven equationdiscovery methods might provide a stepping stone, as for example done for ocean mesoscale eddiesin pioneering work by Zanna and Bolton [120]. We thank Romit Maulik, Tapio Schneider, and Laure Zanna for insightful discussions about data-driven SGS modeling, and Karan Jakhar, Rambod Mojgani, and Ebrahim Nabizadeh for helpfulcomments on the manuscript. This work was supported by an award from the ONR Young In-vestigator Program (N00014-20-1-2722) and a grant from the NSF CSSI program (OAC-2005123)to P.H. Computational resources were provided by NSF XSEDE (allocation ATM170020) to useBridges GPU and Comet GPU clusters and by the Rice University Center for Research Com-puting. The codes for CNN and CNN with transfer learning are publicly available at https://github.com/envfluids/2D-DDP . A Initial condition for DNS

Following previous studies, we choose the initial conditions of DNS to have the same energy spec-trum but randomly diﬀerent vorticity ﬁelds [55, 56, 67]. The initial energy spectrum is given by[67] ˆ E ( k ) = Ak e − ( k/k p ) , (18)where the amplitude is A = 4 k − p π , (19)and k = | k | = q k x + k y . The maximum value of the energy spectrum occurs at √ k p , where k p = 10 is used here following Ref. [55]. The given energy spectrum in turn determines themagnitude of the Fourier coeﬃcients of vorticity: | ˆ ω ( k ) | = s kπ ˆ E ( k ) . (20)23hen the vorticity distribution in Fourier space isˆ ω ( k ) = | ˆ ω ( k ) | e i η ( k ) , (21)where η ( k ) = η ( k ) + η ( k ). η ( k ) and η ( k ) are independent random numbers from a uniformdistribution in [0 , π ] at each ( k x , k y ) when both k x , k y ≥ k x − k y plane).The values at the other quadrants are as follows: η ( k ) = − η ( k ) + η ( k ) for k x < , k y ≥ η ( k ) = − η ( k ) − η ( k ) for k x < , k y < η ( k ) = + η ( k ) − η ( k ) for k x ≥ , k y < t = 0.Figures 1(a) and 1(d) show an example of the initial ω ( x, y ) and the corresponding ˆ E ( k ),respectively. The initial vorticity is dominated by relatively large-scale structures, but small-scalestructures emerge as the ﬂow evolves (Figs. 1(b), (c), and (d)). From t ≈ τ , the ˆ E ( k ) spectrumexhibits self-similarity and follows the Kraichnan-Batchelor-Leith (KBL) theory [3, 41, 46]. Between t = 50 τ and 200 τ , the ﬂow decays due to the viscous dissipation, the small-scale structures fadeaway, and the large, coherent vortices merge and grow as a result of the inverse energy cascade.Following previous studies, we focus on this phase of the decaying 2D turbulence and discard theﬁrst 50 τ as spin-up [4, 55]. 24 eferences [1] A. Arakawa and V. R. Lamb. Computational design of the basic dynamical processes of the UCLAgeneral circulation model. General Circulation Models of the Atmosphere , 17(Supplement C):173–265,1977.[2] Y. Bar-Sinai, S. Hoyer, J. Hickey, and M. P. Brenner. Learning data-driven discretizations for partialdiﬀerential equations.

Proceedings of the National Academy of Sciences , 116(31):15344–15349, 2019.[3] G. K. Batchelor. Computation of the energy spectrum in homogeneous two-dimensional turbulence.

The Physics of Fluids , 12(12):II–233, 1969.[4] A. Beck, D. Flad, and C.-D. Munz. Deep neural networks for data-driven LES closure models.

Journalof Computational Physics , 398:108910, 2019.[5] A. Beck and M. Kurz. A perspective on machine learning methods in turbulence modelling. arXivpreprint arXiv:2010.12226 , 2020.[6] J. Berner, F. Doblas-Reyes, T. Palmer, G. Shutts, and A. Weisheimer. Impact of a quasi-stochasticcellular automaton backscatter scheme on the systematic error and seasonal prediction skill of a globalclimate model.

Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engi-neering Sciences , 366(1875):2559–2577, 2008.[7] T. Bolton and L. Zanna. Applications of deep learning to ocean data inference and subgrid parame-terization.

Journal of Advances in Modeling Earth Systems , 11(1):376–399, 2019.[8] S. L. Brunton, B. R. Noack, and P. Koumoutsakos. Machine learning for ﬂuid mechanics.

AnnualReview of Fluid Mechanics , 52:477–508, 2020.[9] D. Carati, S. Ghosal, and P. Moin. On the representation of backscatter in dynamic localizationmodels.

Physics of Fluids , 7(3):606–616, 1995.[10] G. J. Chandler and R. R. Kerswell. Invariant recurrent solutions embedded in a turbulent two-dimensional Kolmogorov ﬂow.

Journal of Fluid Mechanics , 722:554–595, 2013.[11] J. R. Chasnov. Simulation of the Kolmogorov inertial subrange using an improved subgrid model.

Physics of Fluids A: Fluid Dynamics , 3(1):188–200, 1991.[12] A. Chattopadhyay, P. Hassanzadeh, and S. Pasha. Predicting clustered weather patterns: A test casefor applications of convolutional neural networks to spatio-temporal climate data.

Scientiﬁc Reports ,10(1):1–13, 2020.[13] A. Chattopadhyay, P. Hassanzadeh, and D. Subramanian. Data-driven predictions of a multiscaleLorenz 96 chaotic system using machine-learning methods: reservoir computing, artiﬁcial neural net-work, and long short-term memory network.

Nonlinear Processes in Geophysics , 27(3):373–389, 2020.[14] A. Chattopadhyay, M. Mustafa, P. Hassanzadeh, and K. Kashinath. Deep spatial transformers for au-toregressive data-driven forecasting of geophysical turbulence. In

Proceedings of the 10th InternationalConference on Climate Informatics , pages 106–112, 2020.[15] A. Chattopadhyay, E. Nabizadeh, and P. Hassanzadeh. Analog forecasting of extreme-causing weatherpatterns using deep learning.

Journal of Advances in Modeling Earth Systems , 12(2):e2019MS001958,2020.[16] A. Chattopadhyay, A. Subel, and P. Hassanzadeh. Data-driven super-parameterization using deeplearning: Experimentation with multi-scale Lorenz 96 systems and transfer-learning.

Journal of Ad-vances in Modeling Earth Systems , 12(11):e2020MS002084, 2020.[17] A. J. Chorin, O. H. Hald, and R. Kupferman. Optimal prediction and the Mori–Zwanzig representationof irreversible processes.

Proceedings of the National Academy of Sciences , 97(7):2968–2973, 2000.[18] A. Dipankar, B. Stevens, R. Heinze, C. Moseley, G. Z¨angl, M. Giorgetta, and S. Brdar. Large eddysimulation using the general circulation model ICON.

Journal of Advances in Modeling Earth Systems ,7(3):963–986, 2015.[19] J. A. Domaradzki, R. W. Metcalfe, R. S. Rogallo, and J. J. Riley. Analysis of subgrid-scale eddyviscosity with use of results from direct numerical simulations.

Physical Review Letters , 58(6):547,1987.[20] J. A. Domaradzki and E. M. Saiki. A subgrid-scale model based on the estimation of unresolved scalesof turbulence.

Physics of Fluids , 9(7):2148–2164, 1997.[21] S. B. Driss, M. Soua, R. Kachouri, and M. Akil. A comparison study between MLP and convolutionalneural network models for character recognition. In

Real-Time Image and Video Processing 2017 ,volume 10223. International Society for Optics and Photonics, 2017.[22] O. R. Dunbar, A. Garbuno-Inigo, T. Schneider, and A. M. Stuart. Calibration and uncertainty quan-tiﬁcation of convective parameters in an idealized GCM. arXiv preprint arXiv:2012.13262 , 2020.

23] K. Duraisamy. Machine Learning-augmented Reynolds-averaged and large eddy simulation models ofturbulence. arXiv preprint arXiv:2009.10675 , 2020.[24] K. Duraisamy, G. Iaccarino, and H. Xiao. Turbulence modeling in the age of data.

Annual Review ofFluid Mechanics , 51:357–377, 2019.[25] S. Formentin, M. Mazzoleni, M. Scandella, and F. Previdi. Nonlinear system identiﬁcation via dataaugmentation.

Systems & Control Letters , 128:56–63, 2019.[26] R. O. Fox. Large-eddy-simulation tools for multiphase ﬂows.

Annual Review of Fluid Mechanics ,44:47–76, 2012.[27] B. Fox-Kemper and D. Menemenlis. Can large eddy simulation techniques improve mesoscale richocean models.

Ocean Modeling in an Eddying Regime , 177:319–337, 2008.[28] H. Frezat, G. Balarac, J. L. Sommer, R. Fablet, and R. Lguensat. Physical invariance in neuralnetworks for subgrid-scale scalar ﬂux modeling. arXiv preprint arXiv:2010.04663 , 2020.[29] M. Gamahara and Y. Hattori. Searching for turbulence models by artiﬁcial neural network.

PhysicalReview Fluids , 2(5):054604, 2017.[30] M. Germano, U. Piomelli, P. Moin, and W. H. Cabot. A dynamic subgrid-scale eddy viscosity model.

Physics of Fluids A: Fluid Dynamics , 3(7):1760–1765, 1991.[31] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio.

Deep Learning , volume 1. MIT press Cambridge,2016.[32] I. Grooms, Y. Lee, and A. J. Majda. Numerical schemes for stochastic backscatter in the inversecascade of quasigeostrophic turbulence.

Multiscale Modeling & Simulation , 13(3):1001–1021, 2015.[33] H. T. Hewitt, M. Roberts, P. Mathiot, A. Biastoch, E. Blockley, E. P. Chassignet, B. Fox-Kemper,P. Hyder, D. P. Marshall, E. Popova, et al. Resolving and parameterising the ocean mesoscale in earthsystem models.

Current Climate Change Reports , pages 1–16, 2020.[34] M. F. Jansen and I. M. Held. Parameterizing subgrid-scale eddy eﬀects using energetically consistentbackscatter.

Ocean Modelling , 80:36–48, 2014.[35] M. F. Jansen, I. M. Held, A. Adcroft, and R. Hallberg. Energy budget-based backscatter in an eddypermitting primitive equation model.

Ocean Modelling , 94:15–26, 2015.[36] M. L. Kaandorp and R. P. Dwight. Data-driven modelling of the Reynolds stress tensor using randomforests with invariance.

Computers & Fluids , page 104497, 2020.[37] K. Kashinath, M. Mustafa, A. Albert, J.-L. Wu, C. Jiang, S. Esmaeilzadeh, K. Azizzadenesheli,R. Wang, A. Chattopadhyay, A. Singh, A. Manepalli, D. Chirila, R. Yu, R. Walters, B. White, H. Xiao,H. A. Tchelepi, P. Marcus, A. Anandkumar, P. Hassanzadeh, and Prabhat. Physics-informed machinelearning: case studies for weather and climate modelling.

Philosophical Transactions of the RoyalSociety A: Mathematical, Physical and Engineering Sciences , 379(2194):20200093, 2021.[38] R. M. Kerr, J. A. Domaradzki, and G. Barbier. Small-scale properties of nonlinear interactions andsubgrid-scale energy transfer in isotropic turbulence.

Physics of Fluids , 8(1):197–208, 1996.[39] S. Khani and M. L. Waite. Backscatter in stratiﬁed turbulence.

European Journal of Mechanics-B/Fluids , 60:1–12, 2016.[40] B. Knaepen and P. Moin. Large-eddy simulation of conductive ﬂows at low magnetic Reynolds number.

Physics of Fluids , 16(5):1255–1261, 2004.[41] R. H. Kraichnan. Inertial ranges in two-dimensional turbulence.

The Physics of Fluids , 10(7):1417–1423, 1967.[42] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neuralnetworks. In

Advances in neural information processing systems , pages 1097–1105, 2012.[43] D. Krueger, E. Caballero, J.-H. Jacobsen, A. Zhang, J. Binas, R. L. Priol, and A. Courville. Out-of-distribution generalization via risk extrapolation (REx). arXiv preprint arXiv:2003.00688 , 2020.[44] M. Kurz and A. Beck. A machine learning framework for LES closure terms. arXiv preprintarXiv:2010.03030 , 2020.[45] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel.Handwritten digit recognition with a back-propagation network. In

Advances in neural informationprocessing systems , pages 396–404, 1990.[46] C. Leith. Atmospheric predictability and two-dimensional turbulence.

Journal of the AtmosphericSciences , 28(2):145–161, 1971.[47] D. Leslie and G. Quarini. The application of turbulence theory to the formulation of subgrid modellingprocedures.

Journal of Fluid Mechanics , 91(1):65–91, 1979.

48] Z. Li, X. Bian, X. Li, and G. E. Karniadakis. Incorporation of memory eﬀects in coarse-grainedmodeling via the Mori-Zwanzig formalism.

The Journal of chemical physics , 143(24):243128, 2015.[49] D. K. Lilly. A proposed modiﬁcation of the Germano subgrid-scale closure method.

Physics of FluidsA: Fluid Dynamics , 4(3):633–635, 1992.[50] J. Ling, A. Kurzawski, and J. Templeton. Reynolds averaged turbulence modelling using deep neuralnetworks with embedded invariance.

Journal of Fluid Mechanics , 807:155–166, 2016.[51] Y. Liu, L. Lu, L. Fang, and F. Gao. Modiﬁcation of Spalart–Allmaras model with consideration ofturbulence energy backscatter using velocity helicity.

Physics Letters A , 375(24):2377–2381, 2011.[52] P. P. Mana and L. Zanna. Toward a stochastic parameterization of ocean mesoscale eddies.

OceanModelling , 79:1–20, 2014.[53] P. J. Mason and D. J. Thomson. Stochastic backscatter in large-eddy simulations of boundary layers.

Journal of Fluid Mechanics , 242:51–78, 1992.[54] R. Maulik, K. Fukami, N. Ramachandra, K. Fukagata, and K. Taira. Probabilistic neural networksfor ﬂuid ﬂow surrogate modeling and data recovery.

Physical Review Fluids , 5(10):104401, 2020.[55] R. Maulik and O. San. A stable and scale-aware dynamic modeling framework for subgrid-scaleparameterizations of two-dimensional turbulence.

Computers & Fluids , 158:11–38, 2017.[56] R. Maulik, O. San, A. Rasheed, and P. Vedula. Subgrid modelling for two-dimensional turbulenceusing neural networks.

Journal of Fluid Mechanics , 858:122–144, 2019.[57] R. Maulik, H. Sharma, S. Patel, B. Lusch, and E. Jennings. A turbulent eddy-viscosity surrogatemodeling framework for Reynolds-Averaged Navier-Stokes simulations.

Computers & Fluids , page104777, 2020.[58] C. Meneveau and J. Katz. Scale-invariance and turbulence models for large-eddy simulation.

AnnualReview of Fluid Mechanics , 32(1):1–32, 2000.[59] C. Meneveau and T. S. Lund. The dynamic Smagorinsky model and scale-dependent coeﬃcients inthe viscous range of turbulence.

Physics of Fluids , 9(12):3932–3934, 1997.[60] A. Mohan, D. Daniel, M. Chertkov, and D. Livescu. Compressed convolutional LSTM: An eﬃcientdeep learning framework to model high ﬁdelity 3D turbulence. arXiv preprint arXiv:1903.00033 , 2019.[61] A. T. Mohan, D. Tretiak, M. Chertkov, and D. Livescu. Spatio-temporal deep learning models of 3Dturbulence with physics informed diagnostics.

Journal of Turbulence , pages 1–41, 2020.[62] P. Moin and K. Mahesh. Direct numerical simulation: a tool in turbulence research.

Annual Reviewof Fluid Mechanics , 30(1):539–578, 1998.[63] R. D. Moser, S. W. Haering, and G. R. Yalla. Statistical properties of subgrid-scale turbulence models.

Annual Review of Fluid Mechanics , 53, 2020.[64] B. Nadiga. Stochastic vs. deterministic backscatter of potential enstrophy in geostrophic turbulence.

Stochastic Physics and Climate Modeling, edited by T. Palmer and P. Williams (Cambridge UniversityPress, Cambridge, England, 2009) , 2010.[65] G. Novati, H. L. de Laroussilhe, and P. Koumoutsakos. Automating turbulence modelling by multi-agent reinforcement learning.

Nature Machine Intelligence , pages 1–10, 2021.[66] J. O’Brien, J. Urzay, M. Ihme, P. Moin, and A. Saghaﬁan. Subgrid-scale backscatter in reacting andinert supersonic hydrogen-air turbulent mixing layers.

Journal of Fluid Mechanics , 743:554, 2014.[67] P. Orlandi.

Fluid ﬂow phenomena: a numerical toolkit , volume 55. Springer Science & Business Media,2012.[68] S. Pan and K. Duraisamy. Long-time predictive modeling of nonlinear dynamical systems using neuralnetworks.

Complexity , 2018, 2018.[69] S. Pandey, J. Schumacher, and K. R. Sreenivasan. A perspective on machine learning in turbulentﬂows.

Journal of Turbulence , pages 1–18, 2020.[70] E. J. Parish and K. Duraisamy. A paradigm for data-driven predictive modeling using ﬁeld inversionand machine learning.

Journal of Computational Physics , 305:758–774, 2016.[71] E. J. Parish and K. Duraisamy. A dynamic subgrid scale model for large eddy simulations based onthe Mori–Zwanzig formalism.

Journal of Computational Physics , 349:154–175, 2017.[72] J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott. Model-free prediction of large spatiotemporallychaotic systems from data: A reservoir computing approach.

Physical Review Letters , 120(2):024102,2018.

73] J. Pathak, M. Mustafa, K. Kashinath, E. Motheau, T. Kurth, and M. Day. Using machine learningto augment coarse-grid computational ﬂuid dynamics simulations. arXiv preprint arXiv:2010.00072 ,2020.[74] S. Pawar, S. E. Ahmed, and O. San. Interface learning in ﬂuid dynamics: Statistical inference ofclosures within micro–macro-coupling models.

Physics of Fluids , 32(9):091704, 2020.[75] S. Pawar, O. San, A. Rasheed, and P. Vedula. A priori analysis on deep learning of subgrid-scaleparameterizations for Kraichnan turbulence.

Theoretical and Computational Fluid Dynamics , pages1–27, 2020.[76] C. Peyrard, F. Mamalet, and C. Garcia. A comparison between multi-layer perceptrons and convolu-tional neural networks for text image super-resolution. In

VISAPP (1) , pages 84–91, 2015.[77] U. Piomelli. Large-eddy simulation: achievements and challenges.

Progress in Aerospace Sciences ,35(4):335–362, 1999.[78] U. Piomelli, W. H. Cabot, P. Moin, and S. Lee. Subgrid-scale backscatter in turbulent and transitionalﬂows.

Physics of Fluids A: Fluid Dynamics , 3(7):1766–1771, 1991.[79] U. Piomelli, T. A. Zang, C. G. Speziale, and M. Y. Hussaini. On the large-eddy simulation of transi-tional wall-bounded ﬂows.

Physics of Fluids A: Fluid Dynamics , 2(2):257–265, 1990.[80] H. Pitsch. Large-eddy simulation of turbulent combustion.

Annual Review of Fluid Mechanics , 38:453–482, 2006.[81] S. B. Pope.

Turbulent Flows . IOP Publishing, 2001.[82] G. D. Portwood, B. T. Nadiga, J. A. Saenz, and D. Livescu. Interpreting neural network models ofresidual scalar ﬂux.

Journal of Fluid Mechanics , 907, 2021.[83] K. G. Pressel, S. Mishra, T. Schneider, C. M. Kaul, and Z. Tan. Numerics and subgrid-scale modelingin large eddy simulations of stratocumulus clouds.

Journal of Advances in Modeling Earth Systems ,9(2):1342–1365, 2017.[84] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: A deep learningframework for solving forward and inverse problems involving nonlinear partial diﬀerential equations.

Journal of Computational Physics , 378:686–707, 2019.[85] S. Rasp, M. S. Pritchard, and P. Gentine. Deep learning to represent subgrid processes in climatemodels.

Proceedings of the National Academy of Sciences , 115(39):9684–9689, 2018.[86] P. Sagaut.

Large eddy simulation for incompressible ﬂows: An introduction . Springer Science &Business Media, 2006.[87] P. Sagaut, M. Terracol, and S. Deck.

Multiscale and multiresolution approaches in turbulence-LES,DES and Hybrid RANS/LES methods: Applications and Guidelines . World Scientiﬁc, 2013.[88] H. Salehipour and W. Peltier. Deep learning of mixing by two “atoms” of stratiﬁed turbulence.

Journalof Fluid Mechanics , 861, 2019.[89] F. Sarghini, G. De Felice, and S. Santini. Neural networks based subgrid scale modeling in large eddysimulations.

Computers & Fluids , 32(1):97–108, 2003.[90] H. Sarlak, C. Meneveau, and J. N. Sørensen. Role of subgrid-scale modeling in large eddy simulationof wind turbine wake interactions.

Renewable Energy , 77:386–399, 2015.[91] T. Schneider, S. Lan, A. Stuart, and J. Teixeira. Earth system modeling 2.0: A blueprint for modelsthat learn from observations and targeted high-resolution simulations.

Geophysical Research Letters ,44(24):12–396, 2017.[92] V. Shinde. Proper orthogonal decomposition assisted subﬁlter-scale model of turbulence for large eddysimulation.

Physical Review Fluids , 5(1):014605, 2020.[93] G. Shutts. A kinetic energy backscatter algorithm for use in ensemble prediction systems.

QuarterlyJournal of the Royal Meteorological Society: A journal of the atmospheric sciences, applied meteorologyand physical oceanography , 131(612):3079–3102, 2005.[94] J. Sirignano, J. F. MacArt, and J. B. Freund. DPM: A deep learning PDE augmentation method withapplication to large-eddy simulation.

Journal of Computational Physics , 423:109811, 2020.[95] J. Smagorinsky. General circulation experiments with the primitive equations: I. The basic experiment.

Monthly Weather Review , 91(3):99–164, 1963.[96] A. N. Souza, G. Wagner, A. Ramadhan, B. Allen, V. Churavy, J. Schloss, J. Campin, C. Hill, A. Edel-man, J. Marshall, et al. Uncertainty quantiﬁcation of ocean parameterizations: Application to thek-proﬁle-parameterization for penetrative convection.

Journal of Advances in Modeling Earth Sys-tems , 12(12):e2020MS002108, 2020.

97] R. J. Stevens, L. A. Mart´ınez-Tossas, and C. Meneveau. Comparison of wind farm large eddy simula-tions using actuator disk and actuator line models with wind tunnel experiments.

Renewable energy ,116:470–478, 2018.[98] R. Stoﬀer, C. M. van Leeuwen, D. Podareanu, V. Codreanu, M. A. Veerman, M. Janssens, O. K.Hartogensis, and C. C. van Heerwaarden. Development of a large-eddy simulation subgrid model basedon artiﬁcial neural networks: A case study of turbulent channel ﬂow.

Geoscientiﬁc Model DevelopmentDiscussions , pages 1–29, 2020.[99] A. Subel, A. Chattopadhyay, Y. Guan, and P. Hassanzadeh. Data-driven subgrid-scale modeling offorced Burgers turbulence using deep learning with generalization to higher Reynolds numbers viatransfer learning.

Physics of Fluids (in press) , 2021. available at arXiv:2012.06664.[100] P. Tabeling. Two-dimensional turbulence: A physicist approach.

Physics Reports , 362(1):1–62, 2002.[101] S. Taghizadeh, F. D. Witherden, and S. S. Girimaji. Turbulence closure modeling with data-driventechniques: Physical compatibility and consistency considerations. arXiv preprint arXiv:2004.03031 ,2020.[102] Z. Tan, T. Schneider, J. Teixeira, and K. G. Pressel. Large-eddy simulation of subtropical cloud-topped boundary layers: 2. Cloud response to climate change.

Journal of Advances in Modeling EarthSystems , 9(1):19–38, 2017.[103] J. Thuburn, J. Kent, and N. Wood. Cascades, backscatter and conservation in numerical models oftwo-dimensional turbulence.

Quarterly Journal of the Royal Meteorological Society , 140(679):626–638,2014.[104] G. K. Vallis.

Atmospheric and Oceanic Fluid Dynamics . Cambridge University Press, 2017.[105] W. T. Verkley, C. A. Severijns, and B. A. Zwaal. A maximum entropy approach to the interac-tion between small and large scales in two-dimensional turbulence.

Quarterly Journal of the RoyalMeteorological Society , 145(722):2221–2236, 2019.[106] Z. Y. Wan, P. Vlachas, P. Koumoutsakos, and T. Sapsis. Data-assisted reduced-order modeling ofextreme events in complex dynamical systems.

PLOS One , 13(5):e0197704, 2018.[107] J.-X. Wang, J.-L. Wu, and H. Xiao. Physics-informed machine learning approach for reconstructingReynolds stress modeling discrepancies based on DNS data.

Physical Review Fluids , 2(3):034603, 2017.[108] Z. Wang, K. Luo, D. Li, J. Tan, and J. Fan. Investigations of data-driven closure for subgrid-scalestress in large-eddy simulation.

Physics of Fluids , 30(12):125101, 2018.[109] R. R. Wilcox.

Fundamentals of modern statistical methods: Substantially improving power and accu-racy . Springer, 2010.[110] J. Wouters and V. Lucarini. Multi-level dynamical systems: Connecting the Ruelle response theoryand the Mori-Zwanzig approach.

Journal of Statistical Physics , 151(5):850–860, 2013.[111] J.-L. Wu, K. Kashinath, A. Albert, D. Chirila, H. Xiao, et al. Enforcing statistical constraints ingenerative adversarial networks for modeling chaotic dynamical systems.

Journal of ComputationalPhysics , 406:109209, 2020.[112] J.-L. Wu, H. Xiao, and E. Paterson. Physics-informed machine learning approach for augmentingturbulence models: A comprehensive framework.

Physical Review Fluids , 3(7):074602, 2018.[113] C. Xie, J. Wang, H. Li, M. Wan, and S. Chen. Artiﬁcial neural network mixed model for large eddysimulation of compressible isotropic turbulence.

Physics of Fluids , 31(8):085112, 2019.[114] C. Xie, J. Wang, K. Li, and C. Ma. Artiﬁcial neural network approach to large-eddy simulation ofcompressible isotropic turbulence.

Physical Review E , 99(5):053113, 2019.[115] C. Xie, J. Wang, and E. Weinan. Modeling subgrid-scale forces by spatial artiﬁcial neural networks inlarge eddy simulation of turbulence.

Physical Review Fluids , 5(5):054606, 2020.[116] C. Xie, X. Xiong, and J. Wang. Artiﬁcial neural network approach for turbulence models: A localframework. arXiv preprint arXiv:2101.10528 , 2021.[117] Y. Xie, E. Franz, M. Chu, and N. Thuerey. tempoGAN: A temporally coherent, volumetric GAN forsuper-resolution ﬂuid ﬂow.

ACM Transactions on Graphics (TOG) , 37(4):1–15, 2018.[118] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks?In

Advances in neural information processing systems , pages 3320–3328, 2014.[119] Y. Zang, R. L. Street, and J. R. Koseﬀ. A dynamic mixed subgrid-scale model and its application toturbulent recirculating ﬂows.

Physics of Fluids A: Fluid Dynamics , 5(12):3186–3196, 1993.[120] L. Zanna and T. Bolton. Data-driven equation discovery of ocean mesoscale closures.

GeophysicalResearch Letters , 47(17):e2020GL088376, 2020. PhysicalReview A , 43(12):7049, 1991.[122] Z. Zhou, G. He, S. Wang, and G. Jin. Subgrid-scale model for large-eddy simulation of isotropicturbulent ﬂows using an artiﬁcial neural network.

Computers & Fluids , 195:104319, 2019., 195:104319, 2019.