[PDF] Turbulence closure modeling with data-driven techniques: Influence of neural network hyperparameters and training procedure on generalizability

Abstract

Generalizability of machine-learning (ML) assisted turbulence closure models to unseen flows remains an important challenge. It is well known that the architecture of a neural network and the manner of training have a profound influence on the predictive capability and generalizability of the resulting model [I. Goodfellow, Y. Bengio, and A. Courville, Deep learning, Vol. 1 (2016)]. The objective of the present work is to characterize the influence of the choice of (i) the network (in terms of the number of neurons and hidden layers); and (ii) training process elements such as type of loss function on the generalizability of the ML models. A systematic examination of ML model generalizability over a broad parameter regime requires detailed datasets for many flows in the entire domain. The main novelty of the present study is to utilize a simplified proxy-physics surrogate of turbulence to obtain data rather than prohibitively expensive high-fidelity simulations or experiments. The chosen proxy model exhibits important features including bifurcation and rapid distortion effects in different parameter regimes. Using proxy-physics data, training and testing of various neural networks are performed over different regimes. Optimal network architecture and training techniques are identified based on performance in these test studies. The findings provide a clear basis for choosing neural network architecture and training procedure for turbulence applications.

Full PDF

AAIP/123-QED

Turbulence closure modeling with data–driven techniques: Inﬂuence of neuralnetwork hyperparameters and training procedure on generalizability

S. Taghizadeh, a) F. D. Witherden, Y. A. Hassan, and S. S. Girimaji Department of Mechanical Engineering, Texas A&M University, College Station,TX 77843, USA Department of Ocean Engineering, Texas A&M University, College Station, TX 77843,USA Department of Nuclear Engineering, Texas A&M University, College Station, TX 77843,USA (Dated: 3 February 2021)

Generalizability of machine–learning (ML) assisted turbulence closure models to unseenﬂows remains an important challenge. It is well known that the architecture of a neural net-work and the manner of training have a profound inﬂuence on the predictive capability andgeneralizability of the resulting model [I. Goodfellow, Y. Bengio, and A. Courville, Deeplearning, Vol. 1 (2016)]. The objective of the present work is to characterize the inﬂuenceof the choice of (i) the network (in terms of the number of neurons and hidden layers);and (ii) training process elements such as type of loss function on the generalizability ofthe ML models. A systematic examination of ML model generalizability over a broad pa-rameter regime requires detailed datasets for many ﬂows in the entire domain. The mainnovelty of the present study is to utilize a simpliﬁed proxy–physics surrogate of turbulenceto obtain data rather than prohibitively expensive high–ﬁdelity simulations or experiments.The chosen proxy model exhibits important features including bifurcation and rapid distor-tion effects in different parameter regimes. Using proxy–physics data, training and testingof various neural networks are performed over different regimes. Optimal network archi-tecture and training techniques are identiﬁed based on performance in these test studies.The ﬁndings provide a clear basis for choosing neural network architecture and trainingprocedure for turbulence applications. a) Electronic mail: [email protected]. a r X i v : . [ phy s i c s . f l u - dyn ] F e b . INTRODUCTION Turbulent ﬂows exhibit vastly different characteristics in different parameter regimes dependingupon the mean strain-to-rotation rate ratio , mean-to-turbulence time scale ratio , underlyingﬂow instabilities , largescale unsteadiness, presence or absence of system–rotation or streamline–curvature , body–force effects and ﬂow geometry. In addition, due to inherent complexity ofthe turbulence phenomenon, closure models developed in one parameter regime cannot be pre-sumed to be reasonable or even valid in other regimes. The strong dependence of ﬂow statisticson the various physical parameters is one of the enduring challenges in the ﬁeld of turbulenceclosure model development. The degree of difﬁculty of closure modeling depends upon the levelof closure. In the Reynolds–averaged Navier–Stokes (RANS) method, the fundamental governingequations are averaged over all scales of motion leading to signiﬁcant reduction in computationaleffort needed for performing a ﬂow simulation. The reduction in computational burden comesat the cost of increased complexity of closure modeling. Ad hoc simpliﬁcations or assumptionsare typically invoked to close various terms in the RANS equations. While RANS models mayperform adequately in ﬂows in which they are calibrated, they can be catastrophically wrong inother complex ﬂows due to lack of generalizability. Despite inherent limitations, RANS is widelyused in industrial applications involving complex ﬂows due to ease of computations. At the otherextreme of the closure spectrum, in the large–eddy simulation (LES) approach, dynamically im-portant scales are resolved and only the small–scale motions are modeled. The small–scales aresigniﬁcantly easier to model as they embody most of the ‘universal’ aspects of turbulence and,therefore, are more easily amenable to generalizability than their RANS counterpart. Thus, inLES, the relative simplicity and generalizability of subgrid closure models comes at the expenseof signiﬁcantly increased computational costs. The closure modeling challenges of scale resolvingsimulations (SRS) are of intermediate degree of difﬁculty as they resolve more scales than RANSbut signiﬁcantly lesser than LES .There is heightened expectation in recent times that ML techniques can be used to signiﬁcantlyimprove RANS turbulence closure model performance in complex industrial ﬂows. The rationaleis that the shortcomings of the physics–based closures can be circumvented by appropriate data–based training of the models. Toward this end, many authors have used neural networks to modelthe turbulence constitutive relation. However, ML methods for turbulence closure are not withouttheir own challenges and shortcomings. There is no clear guidance on the architecture of the neural2etwork or different elements of the training procedure. Ling et al. showed that embedding theGalilean invariance property in the deep neural network architecture can improve the predictivecapability of ML turbulence models. They used a deep neural network with 8 hidden layers, and30 neurons per hidden layer. Zhang et al. trained a neural network with 4 hidden layers and20 neurons per layer in order to develop a model to predict the Reynolds stress of a channel ﬂowat different Reynolds numbers. Using same channel ﬂow datasets, Fang et al. trained a neuralnetwork with 5 hidden layers and 50 neurons per layer to improve model performance. Jiang etal. used DNS of channel ﬂow at different Reynolds number in order to developed a new algebraicReynolds stress model by training a deep neural network with 9 hidden layers and varying numberof neurons in each layer. Sotgiu et al. developed a Reynolds stress constitutive model by traininga neural network with 8 hidden layers and 8 neurons. They used planner channel and periodichill channel ﬂow datasets for training the ML algorithm. Geneva and Zabaras trained a neuralnetwork with 5 hidden layers and tapered the number of neurons of the last hidden layers in anattempt to improve training performance.In the ﬁeld of computer vision, the architecture of a neural network and the manner of traininghave been demonstrated to have a profound inﬂuence on the predictive capability and generaliz-ability of the resulting model . Indeed, neural networks of different architectures trained withthe same data can produce signiﬁcantly different results during testing and prediction. Lackinga formal procedure for neural network selection and training protocol, ML–assisted turbulenceclosures can be as ad hoc as the traditional models and, more importantly, lack generalizability. Inother areas, it has been shown that ML models developed with subject matter expertise has a betterchance of succeeding . Thus, it is of much interest to examine generalizability in turbulence–likesystems that are much simpler to compute.The objective of this study is to systematically analyze the inﬂuence of neural–network hyper-parameters and training procedure on generalizability of turbulence models to unseen ﬂows. Suchunderstanding can lead to important guidelines and best practice recommendations to fully real-ize the advantages offered by ML. Comprehensive examination of generalizability of ML modelsover a wide parameter range of turbulent ﬂows is rendered difﬁcult due to the lack of high–ﬁdelitydata. Direct numerical simulations (DNS) of a wide range of ﬂows is prohibitively expensive.Experiments can be performed expeditiously only for a small set of ﬂows. While it is prefer-able to perform this study using data from complex turbulent ﬂows, this study focuses on simplerapproach as a ﬁrst step. 3o enable a reasonably rigorous analysis, we (i) restrict our consideration to statistically two–dimensional homogeneous turbulence, which represents the most elementary non–trivial ﬂows ofinterest; and (ii) utilize data generated from a proxy–physics turbulence surrogate that incorporatesmany of the key aspects of real homogeneous ﬂows. The proxy–physics surrogate is expected toprovide some level of turbulence subject matter expertise without expensive data. Clearly, theproxy–physics surrogate must be simple to compute, while retaining key complexities of turbulentﬂows. If the ML models cannot demonstrate generalizability in the simple proxy–physics system,it would be unreasonable to expect ML models to perform well in unseen turbulent ﬂows. Further,the lessons learned for this simpler system can yield valuable insight into the closure modelingof the more complex turbulence phenomenon. The various neural–network hyperparameters andtraining elements considered in this study are: number of hidden layers or depth of network,number of neurons in each layer or width of network and type of loss function.The organization in the rest of the paper is as follows. The selected proxy–physics turbulencesurrogate is explained in Sec. II. The neural network hyperparameteres and training elements aredescribed in Sec. III. Section IV details how data is generated with the proxy–physics turbulencesurrogate in different regimes of two–dimensional homogeneous turbulence. The results fromdifferent architectures and training methods are presented in Sec. V. We conclude with a summaryof ﬁndings and recommendations in Sec. VI. II. METHODOLOGY

In a statistically two–dimensional homogeneous incompressible turbulent ﬂow ﬁeld withoutbody forces, only two parameters govern the physical behavior – Sk / ε and Rk / ε , where S , R , K and ε are respectively mean strain rate, mean rotation rate, turbulent kinetic energy and turbu-lent dissipation. This set of ﬂows do not exhibit any large–scale instabilities, coherent structures,complex geometrical features or statistical unsteadiness. Despite the apparent simplicity, homo-geneous two–dimensional turbulent ﬂows embody multiple complex phenomena that are stronglydependent upon the parameter values. The ﬂow goes from hyperbolic to rectilinear to ellipticstreamline geometry with increasing mean rotation rate. Depending upon the mean ﬂow to tur-bulence strain rate ratio, the ﬂow physics can range from the rapid distortion limit to decayinganisotropic turbulence.Algebraic Reynolds Stress Model (ARSM) does reasonably well in capturing key ﬂow physics4n different parts of the ﬂow regime . The ARSM requires the solution of a cubic equationand appropriate root must be chosen in different ﬂow regimes to yield the correct behavior . Inthis study, we use self–consistent and nonsingular fully explicit algebraic Reynolds Stress Model(EARSM) proposed by Girimaji as the proxy–physics turbulence surrogate to generate stress–strain (constitutive) relationship data for different normalized strain and rotation rates. As men-tioned in the Introduction, generalizability of the neural–network models in this simple system isa prerequisite to generalizability in actual turbulent ﬂows. Further, due to the fact that EARSMcaptures many key statistical features of turbulence, this study will yield much valuable insightinto RANS ML modeling. The proxy–physics data in different parts of the domain are then usedto train neural–networks of different architectures. The networks are then tested in other parameterregimes to examine generalizability. Complete knowledge of the proxy–physics solution enablesprecise error assessment incurred during generalization to test ﬂows.Using representation theory, three–term expansion of the anisotropy tensor in terms of thenormalized strain (S) and normalized rotation rate (R) tensors for two–dimensional mean ﬂowscan be expressed as , b i j = G ( S i j ) + G ( S ik R k j − R ik S k j ) + G ( S ik S k j − δ i j S mn S nm ) , (1)where scalar coefﬁcients G − − G are constitutive closure coefﬁcients (CCC) that must bemodeled. They are functions of scalar invariant of the strain and rotation–rate tensors ( η = S i j S i j , η = R i j R i j ) and other ﬂow quantities such as - turbulent kinetic energy (k), turbulencefrequency ( ω ) and the coefﬁcients of the pressure–strain model. An implicit algebraic equationfor the anisotropy tensor can be obtained in the weak–equilibrium limit of turbulence using thefollowing simpliﬁcation , Db i j Dt = ∂ b i j ∂ t + U k ∂ b i j ∂ x k ≈ , (2)where D / Dt is the substantial derivative following the mean ﬂow. The weak–equilibrium assump-tion is valid for many ﬂows wherein the timescale of anisotropy evolution is rapid compared to thetimescales of mean ﬂow, turbulent kinetic energy, and dissipation rate . Using Eqs. (1) and (2),the nonlinear algebraic Reynolds stress equation can be written as the following cubic ﬁxed–pointequation for G , ( η L ) G − ( η L L ) G + (cid:104) ( L ) + η L L − η ( L ) + η ( L ) (cid:105) G − L L = . (3)5hen G and G can be expressed as , G = − L G L − η L G , G = L G L − η L G . (4)The closure coefﬁcients in Eqs. (3) and (4) are as follows: L = C − , L = C + , L = C − , L = C − , L = C − , (5)where the C (cid:48) s are numerical constants of the pressure–strain correlation model. In this study alinearized version of the so–called SSG pressure–strain model is employed, C = . , C = . , C = . , C = . , C = . . (6)The cubic relation in Eq. (3) has multiple real and complex roots and the selection of the appro-priate solution for G is not straightforward. By considering two physical selection criteria, (i) continuity of G , and (ii) G ≤

0, Girimaji derives a fully explicit solution of the cubic Eq. (3)for scalar coefﬁcient G as follows: G =  L L ( L ) + η ( L ) for η = , L L ( L ) − η ( L ) + η ( L ) for L = , − p + ( − b + √ D ) + ( − b − √ D ) for D > , − p + (cid:113) − a cos ( θ ) for D < , b < , − p + (cid:113) − a cos ( θ + π ) for D < , b > . (7)Here the discriminant D of the cubic Eq. (3) is calculated as: D = b + a , (8)other parameters of Eqs. (7) and (8) are deﬁned as below: a = ( q − p ) , b = ( p − pq + r ) , p = − L η L , q = ( η L ) (cid:104) ( L ) + η L L − η ( L ) + η ( L ) (cid:105) , r = − L L ( η L ) , cos ( θ ) = − b / (cid:112) − a / . (9)The ﬁrst two cases in Eq. (7) are special limiting cases of the last three and it can be shown thatthe limiting behavior can be easily calculated from the general expressions . After the coefﬁcient G is calculated, other CCC can also be obtained using Eq. (4) in the entire parameter space.6IG. 1: Different states of turbulence (a) (b) (c) FIG. 2: CCC contour plots, (a) G , (b) G and (c) G Physics of EARSM.

The goal of this study is to use EARSM to provide some turbulence subjectmatter expertise. For this reason, it is necessary to ensure that the selected proxy model adequatelyincorporates some of the known features of turbulence. Different turbulence states covered by theproxy–physics surrogate are depicted for the considered parameter space in Fig. 1. At rapidlystrained turbulence state, strain rate dominates over rotation rate and the discriminant D of thecubic ﬁxed–point equation for G is negative. The realizability violations occur at this rapidlystrained region due to the governing elastic constitutive relationship. Rotation rate dominatesover strain rate at high values of mean vorticity state. Other important turbulence states are alsorepresented in this ﬁgure. Hyperbolic streamline ﬂows occur when S >> R ; and the streamlinesare elliptic when S << R . When S ≈ R rectilinear shear ﬂows are seen. The contour plot of G given by Eq. (7) is illustrated in Fig. 2. It is evident that G and therefore the effective turbulent7iscosity is well deﬁned in the entire parameter space wherein different important turbulence statesare covered. It is also shown that the coefﬁcient G is a continuous function across bifurcation line D =

0. Overall, the physics underlying G contour is a reasonable facsimile of real turbulent ﬂows.Fig. 2 shows well deﬁned values for G and G coefﬁcients in the parameter space. Therefore,the analytical, self-consistent EARSM model incorporates some of the known phenomenon ofturbulence. Overall, EARSM provides reasonable CCC in the entire parameter space and can beused as an appropriate proxy model of turbulence.The objective of this study is to investigate if the developed ML turbulence model can reason-ably simulate this simple surrogate of turbulence. If the ML models can not perform well withthis simpliﬁed description it is unlikely to perform successfully in real turbulent ﬂows. We alsosee from Fig. 2 that the CCC have a large range of values in the parameter space. In particular, themagnitude of the CCC is approximately zero 10 − for larger values of parameters, while it is inthe order of 10 − in decaying turbulence state. As mentioned earlier, small values of G for large S is a consequence of elastic constitutive behaviour and is needed for preserving realizability. ThisEARSM model is used to generate Reynolds stress data for training and testing ML models in thisstudy. III. MACHINE LEARNING

Previous investigations.

ML is a general term to describe a class of algorithms which use datato generate models. The selection of modeling strategy depends on the type of problem. Recently,deep neural networks have been widely used for modeling turbulent ﬂows. Ling et al. trained adeep neural network with 8 hidden layers, and 30 neurons per hidden layer with available high–ﬁdelity data: duct and channel ﬂow, perpendicular and inclined jet in cross–ﬂow, ﬂow around asquare cylinder and ﬂow through a converging–diverging ﬂow. Their results showed that incor-porating the Galilean invariance property in the deep neural network architecture can improve thepredictive capability of ML turbulence models. Zhang et al. trained a neural network with 4 hid-den layers and 20 neurons per layer in order to develop a model to predict the Reynolds stress of achannel ﬂow at different Reynolds numbers. They obtained well behaved models by introducingregularization in their training algorithm. Using same channel ﬂow datasets, Fang et al. traineda neural network with 5 hidden layers and 50 neurons per layer in order to develop an improvedReynolds stress tensor model. Jiang et al. used DNS of channel ﬂow at different Reynolds num-8IG. 3: Different activation functionsber in order to developed a new algebraic Reynolds stress model by training a deep neural networkwith 9 hidden layers and varying number of neurons in each layer as, 12, 18, 21, 27, 32, 35, 30,28, and 27, respectively. Sotgiu et al. developed a Reynolds stress constitutive model by traininga neural network with 8 hidden layers and 8 neurons. They used planner channel and periodichill channel ﬂow datasets for training the ML algorithm. Geneva and Zabaras trained a neuralnetwork with 5 hidden layers and tapered the number of neurons in the last two hidden layersto prevent weights from being too small and improve training performance. The required train-ing data was generated by performing LES simulations of different ﬂows: converging–divergingchannel, periodic hills, square duct, square cylinder and tandem cylinders. Objective.

Determining a suitable network architecture and training hyperparameters of a deepneural network is an empirical task and it has been shown that there is a strong positive correla-tion between the ﬁnal performance of the trained neural networks and experience of the user inoptimizing the hyperparameters . There is no rigorous guidance on the architecture of the neuralnetwork or different elements of the training procedure in developing ML assisted turbulence mod-els. The objective of this study is to perform a comparative analysis of the different structures offeed–froward neural network. Based on the physics of the proxy–physics surrogate, three differentscenarios are introduced in order to generating training datasets. Then, for each scenario a sys-tematic optimization of the neural network architecture is performed in order to identify optimumnetwork architectures that ensure a low generalization error at different datasets. Deep neural networks.

A deep neural network is a class of model which transforms the input9 𝜂 𝐺 𝐺 𝐺 Hidden Layers Output Layer

Input Layer

FIG. 4: Illustration of a deep feed–forward neural networkparameters (features) through several layers of units (neurons). Each unit (neuron) is connectedby afﬁne linear maps between units in successive layers and then with nonlinear (scalar) activationfunctions within units. Different activation functions such as, rectiﬁed linear unit (ReLU), leakyReLU, exponential linear unit (ELU), sigmoid and hyperbolic tangent are shown in Fig. 3. TheReLU function is the popular choice in the ML literature: σ ( z ) = max ( z , ) , (10)where, z is the input to a neuron. In this study, we selected ReLU activation function for all theneural networks.It should be noted that deep neural networks consist of simple functions and their efﬁciencyemanates from the interactions between large number of hidden layers . Due to their ﬂexiblearchitecture and superior performance in modeling non–linear and complicated relationships withhigh–dimensional data, they have become a popular subset of ML approaches. Although a varietyof network structures, such as convolutional neural networks (CNN), recurrent neural networks(RNN), or long short–term memory (LSTM) networks have been proposed in the ML literature ,we will restrict ourselves to fully connected architectures. Fig. 4 shows the schematic of a deepfeed–forward neural network (also termed as a multi–layer perceptron (MLP)) which is trainedusing backpropagation with gradient descent method. The input layer, the hidden layers and theoutput layer are also shown in this ﬁgure. Loss function.

The deep neural networks are trained by minimizing a loss (objective) function,10hich measures the difference between the predicted output of the model and labels (ground truthdata). It has been shown that the success of ML models depends on the formulation of the lossfunction used for optimization of the model coefﬁcients (weights and biases of neurons) during theML training process . The type of loss function is also problem speciﬁc and need to be selectedproperly. The root mean squared error (RMSE) is the commonly used loss function: RMSE = (cid:115) ∑ Ni = ( y i − ˆ y i ) N , (11)where ˆ y i denotes the ML prediction and y i is the true labeled data; N is the number of trainingdata. Alternatively, the mean absolute percentage error (MAPE) is also considered, MAPE = N N ∑ i = | y i − ˆ y i || y i | , (12)The RMSE loss works well in most cases, while the MAPE loss is better for the case where theoutput has a large range of function values . In order to ﬁnd the appropriate type of loss function,the models trained with MAPE and RMSE are compared in this study.Adding a regularization term in the loss function formulation is one of the common ways ofavoiding overﬁtting. Overﬁtting occurs when an algorithm is tailored to a particular dataset andis not generalized well to deal with unseen datasets. Introducing a regularization term in the lossfunction formulation shrinks the model coefﬁcients towards zero, decrease the complexity of themodel and hence signiﬁcantly reduces the variance . The L norm (Lasso Regression) and L norm (Ridge Regression) are the two common regularization methods , MAPE ( y , y i ) + λ N ∑ i = | ω i | , L − normMAPE ( y , y i ) + λ N ∑ i = ω i , L − norm (13)where ω denotes the weight of each neuron and λ is a positive hyperparameter to determine thestrength of regularization. For this work, to control the overﬁtting in some experiments, RidgeRegression ( L norm regularization) is applied in all the hidden layers during the training of theneural networks. Network hyperparameters.

Architecture and other training hyperparameters of the neural net-works need to be suitably speciﬁed in order to build a robust and effective model that can generalizeto unseen datasets. For instance, hyperparameters include size of the networks (number of layers11ABLE I: Fixed hyperparameters

Hyperparameter ValueActivation function ReLUOptimization algorithm Adam L norm penalization coefﬁcient ( λ ) or depth, width or breadth of each layer), formulation and type of the loss function, regularizationand optimization algorithm, batch size, type of initialization, learning rate, etc. Hyperparameteroptimization is an empirical task and grid search is usually adopted, i.e., many networks with sev-eral different combinations of interval values of each hyperparameter are trained and comparedbased on their accuracy and generalization ability.In principle, all of the hyperparameters in the ML algorithm can be varied and they mighthave signiﬁcant impact on the model performance. However, the predictive capability and gen-eralization of a neural network is mostly controlled by its architecture: the depth and breadth ofnetwork. Therefore, in this study we only perform a systematic search in the network size space toidentify network architectures that are efﬁcient and ensure a low generalization error at differentdatasets. We consider a matrix of network sizes by varying the depth for ﬁve values of 1, 3, 5, 7,10 and the width for ﬁve values of 3, 5, 7, 15, 30. The learning rate is also varied in the range1 × − ∼ × − . Other ﬁxed hyperparameters are shown in Table I. IV. DATA GENERATION WITH PROXY–PHYSICS TURBULENCE SURROGATE

The data needed for training the ML algorithm is generated using the proxy–physics surrogateis discussed in Sec. II. The number of data points that are non–uniformly extracted for each inputparameter in the range [10 − , ] is 150. Therefore, the total number of data points is 22,500 inthe entire parameter space. In this work, different investigations are conducted in order to answerthree main questions:1. What is the optimum neural network architecture when training is performed using datafrom the entire parameter space?2. What is the optimum neural network architecture when data covers only one portion of12 a) (b) (c) FIG. 5: Training (green circles) and testing (red circles) datasets for (a) Case-1, (b) Case-2, (c)Case-3known physics?3. What is the optimum neural network architecture when only narrow band of data is availablein the critical regime of physics?These three investigations lead to important inferences about generalizability. We generate thetraining datasets in three different scenarios as follows.

A. Training data over the entire parameter space

This represents the ideal scenario in which the neural network is trained with data over theentire parameter space. Thus there is no physics that is unseen by the ML model. In this case,generalizability is expected to be trivially straightforward. Sufﬁcient data is gathered over, strain,shear and rotational ﬂow regimes. It should be noted that providing the real turbulence data inthe entire parameter space, requires expensive DNS over a wide range of ﬂows. The generateddata points in the parameters space are randomly split into 75% for training (Training Data) and25% for validation and testing the model (Testing Data). The testing data is used for model eval-uation during architecture optimization. Fig. 5a represents randomly distributed data in the entireparameter space for this case. 13 . Training data only in the strain–dominated region ( D < ) In this case training data is restricted to a subset of the parameter space. Data points in thestrain–dominated region D < D > C. Limited training data in the shear–dominated region

In many instances training data is available only in a very narrow region of the parameterspace. In this regard, we examine the optimum network architecture and generalizability of the MLmodel trained with the limited dataset that covers shear–driven turbulence. As can be seen fromEARSM equation, shear ﬂow represents the bifurcation region between strain and rotation ﬂows.An arbitrary zone near the bifurcation line D = − < ( η − . η ) < V. RESULTS

1. RMSE loss function vs. MAPE loss function

It has been shown that the appropriate formulation of loss function and optimal choice of theﬂow statistics contributing to the loss function impact the success of the ML trained turbulencemodels . As mentioned in Sec. III we examine the performance of the ML models trained withdifferent loss functions in this study. First, we train two deep neural networks with RMSE andMAPE loss functions without L regularization. For this analysis the randomly generated datasetin the entire parameter space is segmented as 75% for training and 25% for testing of the models.The selected ﬁxed network architecture for both cases has 7 hidden layers with 7 computationneurons in each layer. All other hyperparameters of the models are as shown in Table I.The performance of the models trained with different loss functions are reported with twoMAPE and RMSE error metrics on training and testing datasets in each column of Table II. It can14ABLE II: Performance of models trained with different loss functions Errors with various metricsRMSE–training RMSE–testing MAPE–training MAPE–testingRMSE–loss func. 9 . × − . × − . × − . × − be seen that the MAPE loss function is better for training the ML algorithm. As mentioned inSec. II, the CCC have a large range of values over the parameter space. When a RMSE is usedas the loss function, the training and back–propagation process are mostly dominated by the CCCwith larger magnitudes, as the small value coefﬁcients contribute very little to the neural networkoptimization process. Local error contours of the models trained with different loss functions areshown in Figs. 6 and 7. It can be seen that the ML algorithm trained with MAPE has smaller localerrors in different turbulence physics regions in the entire parameter space.Comparing different error metrics, MAPE vs. RMSE for ﬁnal performance of the models inTable II and absolute error vs. MAPE for local error contours in Figs. 6 and 7, clearly showsthat MAPE metric has an unbiased, easily interpretable presentation of ﬁnal model performance.Therefore, MAPE is selected as the appropriate type of loss function for the remainder of theanalysis.

2. Case–1: Training data over the entire parameter space

As a ﬁrst experiment, we train the neural networks with randomly distributed dataset, i.e., 75%for training and 25% for testing the models. For this case the number of hidden layers in neuralnetworks varies among 1, 3, 5 and 7. Number of neurons in each hidden layer varies between3, 5, 7 and 15. Fig. 8 demonstrates the MAPE error of all the 16 network architectures in thetraining and testing datasets. In this ﬁgure number of hidden layers and number of neurons ineach layer are shown with horizontal and vertical axes, respectively. Although for this case L regularization is not used during ML training, all the 16 networks have similar performance onboth the training and testing datasets and models test quite well without over/under ﬁtting. It canbe seen for this case with randomly distributed data in the entire parameter space, neural networkswith total neurons less than 21 with any combination of layers and neurons have the worst training15IG. 6: Absolute error contours of CCC for ML models trained with different loss functions, 1strow: RMSE, 2nd row: MAPEand testing errors. Additionally, as the number of the layers and neurons in each layer increase, thetraining and testing errors decrease and for this case network architecture with 7 hidden layers and15 neurons in each layer (7L–15N) has the best performance in the training and testing datasets.Local MAPE contours of networks with 7 hidden layers and different neurons in the entireparameter space are shown in Fig. 9. It is evident that by increasing the number of neurons ineach layer, error decreases in the entire parameter space and for best model in this case (7L–15N)the maximum error occurs mainly near the bifurcation line D =

3. Case–2: Training data only in the strain–dominated region ( D < ) In this experiment the data points in the strain–dominated region D < D > L norm regularization as deﬁned in Eq. (13) with λ = . a) (b) FIG. 8: Training and testing MAPE for neural networks with different architectures in Case–1architecture in this case. For selected network architecture with 7 hidden layers and differentneurons local MAPE contours in the entire parameter space are shown in Fig. 11. It is evident thatthe network with 7L–15N architecture has best performance in the entire parameter space for allthe CCC.

4. Case–3: Limited training data in the shear–dominated region

In this scenario the training dataset covers a limited but important region of the parameter spacenear the bifurcation line D =

0. The data points in rest of the parameters space are used for testingthe models. The network architectures considered for this case have hidden layers and neuronsvarying in the range 3 to 10 and 3 to 30, respectively. The performance of all the 20 networkarchitectures in the training and testing datasets are compared in Fig. 12. A L norm regularizationwith λ = . a) (b) FIG. 10: Training and testing MAPE for neural networks with different architectures in Case–2CCC.The main inferences from the three case studies can be summarized. When training is per-formed with data from the entire parameter space (Case–1) the accuracy of the predicting the testcases improves near monotonically with network size. As accuracy of training improves, so doesaccuracy in testing as there is no unseen ﬂow physics. However, when training and testing areperformed in different domains of the parameter space (Cases– 2 and 3), larger networks do notnecessarily lead to improved testing. Very small networks do not perform well as there is insuf-ﬁcient degrees of freedom to capture training and testing physics. With increase in network sizetesting accuracy improves until an optimum size in which training and testing errors are reason-ably small. Beyond this optimal size, the larger networks overﬁt the training data and thereforeperforms poorly during testing. The optimal network size for these cases is 7 hidden layers with15 neurons in each layer.

VI. CONCLUSION

ML techniques can potentially enhance the performance of the RANS turbulence closure mod-els signiﬁcantly. Generalizability of ML–assisted turbulence model is one of the key challengesthat requires more investigations. Neural network hyperparameters and training techniques cansigniﬁcantly impact the predictive capability and generalizability of ML turbulence models. Inthis study, the inﬂuence of neural network architectures (number of hidden layers and neurons)20IG. 11: Local MAPE contours in the entire parameter space for Case–2, 1st row: 7L–5N, 2ndrow: 7L–7N, 3rd row: 7L–15N, 4th row: 7L–30N.21 a) (b)

FIG. 12: Training and testing MAPE for neural networks with different architectures in Case–3and training process elements (type of loss function) on generalizability are systematically inves-tigated over a broad parameter space of two–dimensional homogenous turbulence states.The key novelty of this work is the adoption of a simpliﬁed proxy–physics turbulence surrogatethat incorporates important features of real homogeneous ﬂows to generate the training data to as-sess generalizability characteristics. An important advantage of this approach is that training datafor all ﬂows in the parameter space can be generated easily. In contrast, DNS or experiments wouldbe prohibitively expensive and may not even be feasible for all ﬂows in the parameter space. Suc-cessful generalizability of ML models in this proxy–physics turbulence system is a necessary butnot a sufﬁcient condition for ML–model generalizability in actual turbulent ﬂows. Nevertheless,this study provides valuable insight into the generalizability characteristics of different networkarchitectures in turbulence–like phenomena.The main inferences from this study are now listed:1. The loss function for training based on MAPE may be better than RMSE for deep feed–forward neural networks of turbulence CCC. This is due to the fact that MAPE limits themaximum model error throughout of the parameter space. Thus, the possibility of catas-trophic model performance in some part of the parameter space in minimized. On the otherhand, RMSE focuses only on average error in the parameter space without limiting the errorat any speciﬁc location. Therefore, while the model may do well in most regions of theparameter space, signiﬁcant failure in some parts may not be ruled out. Clearly, a loss func-22IG. 13: Local MAPE contours in the entire parameter space for Case–3, 1st row: 7L–5N, 2ndrow: 7L–7N, 3rd row: 7L–15N, 4th row: 7L–30N.23ion based on a combination of MAPE and RMSE can be useful and will be investigated infuture studies.2. As expected, if training is performed with data obtained from the entire parameter space,the ML model does well in testing. In such a case, the accuracy of the model in the testingphase increases nearly monotonically with increasing neurons (breadth) and layers (depth)of the network.3. If the training is performed with data obtained from a limited region of the parameter spaceand tested in other regions, the performance does not improve monotonically with the net-work size. Starting from smaller networks, the training and testing accuracy improves withincreasing network size. Beyond an optimal network size, the predictive capability deterio-rates due to overﬁtting of the training data.4. The optimal network size will depend on the number of coefﬁcients in the constitutive rela-tionship. For the three CCC model used in this study, a seven–layer ﬁfteen–neuron networkperformed best. This network also yielded reasonable behavior in the full parameter–spacetraining and is hence recommended for use in practical applications.In summary, we establish a proxy–physics approach to determine the optimal network archi-tecture and training procedure for a given form of the turbulence constitutive relationship. Wepropose that such a preliminary study for identifying the optimal network hyperparameters be per-formed as an initial step in developing ML–RANS models. Such practice will provide more rigorto training procedure and lead to higher likelihood of success of ML turbulence models.

ACKNOWLEDGMENTS

The authors acknowledge support provided by Texas A&M High Performance Research Com-puting center.

DATA AVAILABILITY

The data that support the ﬁndings of this study are available from the corresponding authorupon reasonable request. 24

EFERENCES G. Blaisdell and K. Shari, “Simulation and modeling of the elliptic streamline ow,” in

Proceed-ings of the Summer Program (Citeseer, 1996) p. 1. S. S. Girimaji, “Pressure–strain correlation modelling of complex turbulent ﬂows,” Journal ofFluid Mechanics , 91–123 (2000). C. G. Speziale, “Analytical methods for the development of reynolds-stress closures in turbu-lence,” Annu. Rev. Fluid Mech , 57 (1991). A. A. Mishra and S. S. Girimaji, “Pressure–strain correlation modeling: towards achieving con-sistency with rapid distortion theory,” Flow, turbulence and combustion , 593–619 (2010). A. A. Mishra and S. S. Girimaji, “Toward approximating non-local dynamics in single-pointpressure–strain correlation closures,” Journal of Fluid Mechanics , 168–188 (2017). C. G. Speziale, R. Abid, and G. A. Blaisdell, “On the consistency of reynolds stress turbulenceclosures with hydrodynamic stability theory,” Physics of Fluids , 781–788 (1996). S. S. GIRIMAJI, “Partially-averaged navier-stokes model for turbulence: A reynolds-averagednavier-stokes to direct numerical simulation bridging method,” Journal of applied mechanics ,413–421 (2006). S. S. Girimaji, E. Jeong, and R. Srinivasan, “Partially averaged navier-stokes method for tur-bulence: Fixed point analysis and comparison with unsteady partially averaged navier-stokes,”Journal of Applied Mechanics , 422 (2006). J. Ling, A. Kurzawski, and J. Templeton, “Reynolds averaged turbulence modelling using deepneural networks with embedded invariance,” Journal of Fluid Mechanics , 155–166 (2016). Z. Zhang, X.-d. Song, S.-r. Ye, Y.-w. Wang, C.-g. Huang, Y.-r. An, and Y.-s. Chen, “Applica-tion of deep learning method to reynolds stress models of channel ﬂow based on reduced-ordermodeling of dns data,” Journal of Hydrodynamics , 58–65 (2019). R. Fang, D. Sondak, P. Protopapas, and S. Succi, “Neural network models for the anisotropicreynolds stress tensor in turbulent channel ﬂow,” Journal of Turbulence , 525–543 (2020). C. Jiang, J. Mi, S. Laima, and H. Li, “A novel algebraic stress model with machine-learning-assisted parameterization,” Energies , 258 (2020). C. Sotgiu, B. Weigand, K. Semmler, and P. Wellinger, “Towards a general data-driven explicitalgebraic reynolds stress prediction framework,” International Journal of Heat and Fluid Flow , 108454 (2019). 25 N. Geneva and N. Zabaras, “Quantifying model form uncertainty in reynolds-averaged turbu-lence models with bayesian deep neural networks,” Journal of Computational Physics , 125–147 (2019). I. Goodfellow, Y. Bengio, and A. Courville,

Deep learning , Vol. 1 (MIT press Cambridge,2016). K. Anand, Z. Wang, M. Loog, and J. van Gemert, “Black magic in deep learning: How humanskill impacts network training,” arXiv preprint arXiv:2008.05981 (2020). T. Gatski, “On explicit algebraic stress models for complex turbulent ﬂows,” J. Fluid Mech ,59–78 (1993). S. S. Girimaji, “Fully explicit and self-consistent algebraic reynolds stress model,” Theoreticaland Computational Fluid Dynamics , 387–402 (1996). S. Taghizadeh, F. D. Witherden, and S. S. Girimaji, “Turbulence closure modeling with data-driven techniques: physical compatibility and consistency considerations,” New Journal ofPhysics , 093023 (2020). S. S. Girimaji, “Lower-dimensional manifold (algebraic) representation of reynolds stress clo-sure equations,” Theoretical and Computational Fluid Dynamics , 259–281 (2001). C. A. Gomez and S. S. Girimaji, “Explicit algebraic reynolds stress model (earsm) for compress-ible shear ﬂows,” Theoretical and Computational Fluid Dynamics , 171–196 (2014). C. SPEZIALE, S. SARKAR, and T. GATSKI, “Modelling the pressure-strain correlation ofturbulence: an invariant dynamical stystems approach,” Journal of Fluid Mechanics , 245–272 (1991). S. Cai, Z. Wang, L. Lu, T. A. Zaki, and G. E. Karniadakis, “Deepm&mnet: Inferring the elec-troconvection multiphysics ﬁelds based on operator approximation by neural networks,” arXivpreprint arXiv:2009.12935 (2020). G. James, D. Witten, T. Hastie, and R. Tibshirani,

An introduction to statistical learning , Vol.112 (Springer, 2013). D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprintarXiv:1412.6980 (2014). X. Glorot and Y. Bengio, “Understanding the difﬁculty of training deep feedforward neuralnetworks,” in

Proceedings of the thirteenth international conference on artiﬁcial intelligenceand statistics (JMLR Workshop and Conference Proceedings, 2010) pp. 249–256.26 R. L. Thompson, L. E. B. Sampaio, F. A. de Bragança Alves, L. Thais, and G. Mompean,“A methodology to evaluate statistical errors in dns data of plane channel ﬂows,” Computers &Fluids , 1–7 (2016). S. V. Poroseva, J. D. Colmenares F, and S. M. Murman, “On the accuracy of rans simulationswith dns data,” Physics of Fluids , 115102 (2016). W. Rodi, “A new algebraic relation for calculating the reynolds stresses,” in

Gesellschaft Ange-wandte Mathematik und Mechanik Workshop Paris France , Vol. 56 (1976) p. 219. D. B. Taulbee, “An improved algebraic reynolds stress model and corresponding nonlinear stressmodel,” Physics of Fluids A: Fluid Dynamics , 2555–2561 (1992). S. Pope, “A more general effective-viscosity hypothesis,” Journal of Fluid Mechanics72