Recovering missing CFD data for high-order discretizations using deep neural networks and dynamics learning
Kevin T. Carlberg, Antony Jameson, Mykel J. Kochenderfer, Jeremy Morton, Liqian Peng, Freddie D. Witherden
RRecovering missing CFD data for high-order discretizations usingdeep neural networks and dynamics learning
Kevin T. Carlberg a,1, ∗ , Antony Jameson c,3 , Mykel J. Kochenderfer b,2 , Jeremy Morton b,2 , Liqian Peng a,1 ,Freddie D. Witherden b,2 a Sandia National Laboratories b Stanford University c Texas A&M University
Abstract
Data I/O poses a significant bottleneck in large-scale CFD simulations; thus, practitioners would like tosignificantly reduce the number of times the solution is saved to disk, yet retain the ability to recover anyfield quantity (at any time instance) a posteriori . The objective of this work is therefore to accuratelyrecover missing CFD data a posteriori at any time instance, given that the solution has been written todisk at only a relatively small number of time instances. We consider in particular high-order discretizations(e.g., discontinuous Galerkin), as such techniques are becoming increasingly popular for the simulation ofhighly separated flows. To satisfy this objective, this work proposes a methodology consisting of two stages:1) dimensionality reduction and 2) dynamics learning. For dimensionality reduction, we propose a novelhierarchical approach. First, the method reduces the number of degrees of freedom within each element ofthe high-order discretization by applying autoencoders from deep learning. Second, the methodology appliesprincipal component analysis to compress the global vector of encodings. This leads to a low-dimensional state,which associates with a nonlinear embedding of the original CFD data. For dynamics learning, we propose toapply regression techniques (e.g., kernel methods) to learn the discrete-time velocity characterizing the timeevolution of this low-dimensional state. A numerical example on a large-scale CFD example characterized bynearly 13 million degrees of freedom illustrates the suitability of the proposed method in an industrial setting.
Keywords:
CFD, high-order schemes, deep learning, autoencoders, dynamics learning, machine learning
1. Introduction
Industrial practitioners of Computational Fluid Dynamics (CFD) often desire high-fidelity scale-resolvingsimulations of transient compressible flows within the vicinity of complex geometries. For example, to improvethe design of next-generation aircraft, one must simulate—at Reynolds numbers in the range – andMach numbers in the range . – . —highly separated flow over deployed spoilers/air-brakes; separated flowwithin serpentine intake ducts; and flow over entire vehicle configurations at off-design conditions. In order ∗ Email addresses: [email protected] (Kevin T. Carlberg), [email protected] (Antony Jameson), [email protected] (Mykel J. Kochenderfer), [email protected] (Jeremy Morton), [email protected] (LiqianPeng), [email protected] (Freddie D. Witherden)
URL: sandia.gov/~ktcarlb (Kevin T. Carlberg), engineering.tamu.edu/aerospace/profiles/jameson-antony.html (Antony Jameson), mykel.kochenderfer.com (Mykel J. Kochenderfer), freddie.witherden.org (Freddie D. Witherden) Authors are listed in alphabetical order. Durand Building, 496 Lomita Mall, Stanford University, Stanford, CA 94305-3035.
701 H.R. Bright Bldg, College Station, TX 77843-3141.
Preprint submitted to Elsevier May 30, 2019 a r X i v : . [ phy s i c s . c o m p - ph ] M a y o perform these simulations, it is necessary to solve the unsteady compressible Navier–Stokes equations witha high level of fidelity.Theoretical studies and numerical experiments have shown high-order accurate discontinuous spectralelement methods (DSEM) to be particularly well suited for these types of simulations [52, 49, 35]. Indeed,we remark that some of the largest CFD simulations to date—involving hundreds of billions of degrees offreedom—have been performed using these methods [51]. DSEMs work by first decomposing the domaininto a set of conforming elements and then representing the solution inside of each element by a polynomial.However, in contrast to classical finite element methods, the solution is permitted to be discontinuous acrosselement boundaries. Examples of DSEMs include discontinuous Galerkin (DG) methods [38, 10, 15], spectraldifference (SD) schemes [26, 29, 43], and flux reconstruction (FR) schemes [18].On account of the large number of spatial degrees of freedom (arising from the large number of discretizationelements and high-order polynomials within each element) and small time steps that are often required foraccuracy, such simulations have the potential to generate petabytes of solution data; this is well beyondwhat can be handled by current I/O and storage sub-systems. Therefore, it is not practical to write out asequence of finely spaced solution snapshots to disk for offline analysis and post-processing. Instead, thepractitioner must instrument the simulation in advance. Such instrumentation typically includes the onlineaccumulation of time averages of various volumetric expressions, integrating forces on boundaries, and theregular sampling of the flow field at specific points in the volume. However, doing this effectively requires adegree of a priori knowledge about the dynamics and evolution of the system, which negates many of theexploratory advantages inherent to simulation.A common strategy for reducing the storage overhead associated with CFD simulations is to compress thedata before writing to disk [20, 39, 47]. Although these approaches can substantially reduce file sizes, they donot necessarily reduce the overall number of time instances at which a file must be written. In this paper, weinstead consider the problem of accurately recovering the full CFD solution at any time instance a posteriori ,given that it has been written at only a relatively small number of time instances. Several methods havebeen proposed to reconstruct missing simulation data, especially in the context of CFD. However, most ofthese methods either aim to reconstruct missing spatial data (which is not the focus of this work) [7, 55] orare not amenable to scenarios where no data has been written at some time instances [2, 48, 12].To address this work’s objective, we propose a novel two-stage strategy that leverages recent innovationsin machine learning: 1) dimensionality reduction and 2) dynamics learning. The first stage is dimensionalityreduction . We propose a novel hierarchical approach to dimensionality reduction that leverages the structureof high-order discretizations. It comprises local compression using autoencoders from deep learning followedby global compression using principal component analysis (PCA). This stage computes an accurate, low-dimensional, nonlinear embedding of the original data while remaining computationally tractable for large-scaledata.For local compression, we reduce the large (e.g., ∼ ) number of degrees of freedom per element in themesh via autoencoders; this step leverages the structure of high-order discretizations. An autoencoder is a(typically deep) neural network that performs dimensionality reduction by learning a nonlinear mapping fromhigh-dimensional feature space to a lower-dimensional encoding [16]. A standard autoencoder consists of twocomponents: an encoder network that performs the mapping between input data and the encodings and a decoder network that attempts to reconstruct the original input from the encodings. The autoencoder andits variants have previously been used extensively for discovering low-dimensional representations of imagedata [50, 25, 45]. However, rather than targeting image data, we use autoencoders for local compressionof CFD data. Recent work has explored using autoencoder architectures for discovering low-dimensionalrepresentations of fluid flow, primarily for the purposes of simulation and flow control [54, 34, 30, 37, 31].In these studies, entire solution states could be treated as neural network inputs (i.e., features) due to therelatively small number of degrees of freedom in the systems being simulated.Unfortunately, memory constraints prohibit training autoencoders on full solution states for many large-scale CFD simulations. Furthermore, due to the large number of parameters that must be optimized, neuralnetworks typically require many training examples; this work assumes that the solution has been writtenat only a relatively small number of time instances. The proposed method avoids these issues: the features2orrespond to the degrees of freedom within a single element (and so pose no memory issue), and the numberof training examples is the number of time instances at which the solution has been written multiplied by thenumber of elements in the mesh (which may be very large, e.g., ∼ ).Global compression is necessary because even if autoencoders significantly reduce the number of degreesof freedom in each element, the number of elements in the mesh may still be large. We therefore applyclassical principal component analysis (with some modifications to handle large-scale data) to reduce thedimensionality of the vector of local encodings across the entire spatial domain. Although PCA identifies alinear subspace, PCA in combination with autoencoders yields a nonlinear embedding of the global CFDsolution. In the past, PCA has been widely applied to global data in the context of CFD, but often underthe name proper orthogonal decomposition (POD) [17, 8]. However, to our knowledge, PCA has not yet beenapplied to a vector of autoencoder codes. The second stage is dynamics learning . Because this work assumes that the solution has been writtento disk at only a relatively small subset of the desired time instances, we must devise an approach forreconstructing the solution at the missing time instances. For this purpose, we propose learning the dynamics.In particular, we assume that there exists a Markovian discrete-time dynamical system that describes thetime-evolution of the low-dimensional latent state (obtained after applying both local and global compression)on the time grid of interest; of course, such a dynamical system may not exist due to the closure problem.Next, we aim to approximate the discrete-time velocity characterizing this hypothesized dynamical system.We regress the (possibly nonlinear) mapping from the low-dimensional latent state at the current time instanceto the low-dimensional latent state at the next time instance. For this purpose, we investigate the viability ofa wide range of regression techniques within a machine-learning framework for model selection, includingsupport vector regression [42], random forests [5], boosted decision trees [13], k -nearest neighbors [1], thevectorial kernel orthogonal greedy algorithm (VKOGA) [58, 59], sparse identification of nonlinear dynamics(SINDy) [6], and dynamic mode decomposition (DMD) [41].Dynamics learning is an active area of research, and many approaches have been proposed for learningdynamics in multiple fields. However, none has been applied for the purpose of reconstructing missingsimulation data, and few of these methods are applicable to the kind of data we consider: very large-scalestates, and relatively few time instances at which the state has been recorded. For example, the field ofdata-driven dynamical systems aims to learn (typically continuous-time) dynamics from observations of thestate or velocity. Many of these methods enforce linear dynamics: DMD [41] enforces linear dynamics forthe full-system state, while methods based on Koopman theory enforce linear dynamics on a finite numberof system observables, which can be specified either manually [57, 56, 22] or learned using autoencoders[44, 34, 30, 31]. Other works have enabled nonlinear dynamics, but a rich library of candidate basis functionsmust be manually specified [6], and model selection using validation data has been limited. In contrast tothese methods, our work aims to model nonlinear dynamics with a large candidate set of functional forms forthe velocity, and subsequently performs model selection using validation data. We also emphasize that theaforementioned autoencoder approaches are not applicable to the large-scale data sets we consider, as theentire state is treated as an input to the autoencoder.Relatedly, the field of state representation learning [28, 3] aims to learn simultaneously both an underlyinglow-dimensional latent space and discrete-time dynamics on this latent space from a set of high-dimensionalobservations (e.g., raw pixels from a camera), typically in a control or reinforcement-learning context. Thedynamics are typically constrained to be (locally) linear to facilitate control [11, 53, 21]. In contrast to theseapproaches, we are not interested in control and thus need not restrict the dynamics to be linear. Finally,these methods would encounter difficulties when applied to the data sets we consider (i.e., a high-dimensionalobservation space, but data at relatively few time instances), as simultaneously learning an autoencoder anddynamics on an observation space of dimension ∼ requires a substantial amount of computation and data.Another contribution of this work is the application of the proposed methodology to a large-scale industrialCFD application characterized by a high-order discretization with degrees of freedom per element and
40 584 elements, yielding nearly × degrees of freedom. These numerical experiments demonstrate theviability of the proposed method, as it significantly reduces the dimension of the solution to only 500 (fora compression ratio of
26 000 : 1 ), yet accurately recovers CFD data, as it incurs sub-1% relative errors in3he global solution and drag over all desired time instances. The numerical experiments show that theVKOGA algorithm yields the best performance for dynamics learning, due both to its low regression testerror and its boundedness. In contrast, while SINDy yielded low test regression error, it yielded an unstabledynamical system to the unboundedness of the polynomial basis functions it employs.The remainder of the paper is organized as follows. Section 2 provides the problem formulation. Section 3describes the two-stage dimensionality-reduction process; here, Section 3.1 describes local compression usingautoencoders, and Section 3.2 describes global compression using PCA. Section 4 describes dynamics learning,wherein Section 4.1 precisely describes the regression setting, Section 4.2 describes the required training data,and Section 4.3 provides a description of all considered regression models. Section 4.4 provides some analysisrelated to boundedness of the learned discrete-time velocity, which has implications on the stability of theresulting dynamical-system model. Section 5 reports the numerical experiments. Finally, Section 6 concludesthe paper.
2. Problem formulation: recovering a state sequence for high-order discretizations
The objective of this work is to recover a sequence of states { w n } N t n =0 ⊂ R N w given only the initial state w ∈ R N w and a subset of the elements of the sequence { w n } n ∈ T sample with T sample ⊂ { , . . . , N t } . Thisscenario often arises in CFD applications, where the practitioner is interested in having access to the fluidsolution at many time instances, but it is prohibitively expensive to write the solution to disk for all timeinstances of interest. In this case, the state corresponds to the vector of fluid variables over the spatialdomain, and the sequence corresponds to the value of these degrees of freedom at time instances of interest(which need not correspond to the time steps taken by the time integrator).This work aims to satisfy this objective in the context of high-order spatial discretizations (e.g., discontin-uous Galerkin) of partial differential equations (PDEs), which also often arise in CFD applications. Suchdiscretizations are characterized by N el elements, each of which has N w , el (cid:29) local degrees of freedom.Mathematically, we can characterize the state as the vectorization of local states as w := vec ( w , . . . , w N el ) , (2.1)where w i ∈ R N w , el , i = 1 , . . . , N el denotes the local state associated with the i th element, and w ∈ R N w denotes the global state with N w = N el N w , el .To achieve the stated objective, we adopt a two-stage process:1. Dimensionality reduction . We apply hierarchical dimensionality reduction to the state w in two steps.First, we apply autoencoders to reduce the local dimensionality, i.e., the dimensionality of the localstates w i , i = 1 , . . . , N el (Section 3.1). Second, we apply principal component analysis (PCA) to globallyreduce the dimensionality of the set of autoencoder-compressed local states (Section 3.2). This results ina low-dimensional state x = r ( w ) ∈ R N x with N x (cid:28) N w from which the full state can be approximatedas w ≈ p ( x ) ∈ R N w . Here, r : R N w → R N x is the restriction operator associated with dimensionalityreduction, and p : R N x → R N w is the prolongation operator.2. Dynamics learning . We regress the discrete-time dynamics of the low-dimensional state x (Section 4)under the assumption that the sequence of low-dimensional states { x n } N t n =0 ⊂ R N x can be recoveredfrom the Markovian dynamical system x n +1 = f ( x n ) , n = 0 , . . . , N t − (2.2)with x = r ( w ) for some unknown velocity f . Then, we construct a regression approximation tothe discrete-time velocity ˜ f ≈ f . This allows the desired sequence { w n } N t n =0 to be approximated as { ˜ w n } N t n =0 , where ˜ w = w and ˜ w n = p (˜ x n ) , n = 1 , . . . , N t . Here, the approximated low-dimensionalstate evolves according to the approximated dynamics ˜ x n +1 = ˜ f (˜ x n ) , n = 0 , . . . , N t − (2.3)with ˜ x = r ( w ) . 4e note that while stage 2 could be executed without stage 1, learning discrete-time dynamics of a low-dimensional state requires substantially less data and computational effort than doing so for a high-dimensionalstate.
3. Hierarchical dimensionality reduction for high-order discretizations
This section describes the first stage of the proposed methodology: dimensionality reduction. Section 3.1describes local compression using autoencoders, and Section 3.2 describes global compression using PCA.
The first step of the proposed approach is to reduce the dimensionality of the local states. To achievethis, we make the observation that the total number of available training examples is equal to the number ofsamples | T sample | multiplied by the number of elements N el , where | · | denotes the cardinality of a set. Forfine spatial discretizations, N el is large, which implies that we may have access to a large amount of trainingdata, even for a single simulation with | T sample | ∼ . Due to this fact, we apply autoencoders, as they oftenenable very accurate nonlinear embeddings, yet require a large amount of training data.In the present context, autoencoders aim to find two mappings: the encoder φ : R N w , el → R N ˆ w , el that mapsa local state w i ∈ R N w , el to lower-dimensional representation ˆ w i ∈ R N ˆ w , el ; i.e. φ : w i (cid:55)→ ˆ w i , i = 1 , . . . , N el ;and the decoder ψ : R N ˆ w , el → R N w , el . Here, N ˆ w , el ( (cid:28) N w , el ) denotes the reduced dimension of the local statevector or code . Its value dictates the tradeoff between compression and reconstruction accuracy.The encoder φ takes the form of a feed-forward neural network with N layers layers constructed such that φ ( w i ; θ ) = h N layers ( · ; Θ N layers ) ◦ h N layers − ( · ; Θ N layers − ) ◦ · · · ◦ h ( w i ; Θ ) , (3.1)where h i ( · ; Θ i ) : R p i − → R p i , i = 1 , . . . , N layers denotes the function applied at layer i of the neural network; Θ i , i = 1 , . . . , N layers denote the weights employed at layer i ; p i denotes the dimensionality of the outputat layer i ; and θ ≡ ( Θ , . . . , Θ N layers ) . The input is of dimension p = N w , el and the final (output) layer( i = N layers ) produces the code ˆ w i = φ ( w i ) such that p N layers = N ˆ w , el . An activation function h is applied inlayers 1 to N layers to some function of the weights and the outputs from the previous layer y i − ∈ R p i − suchthat h i ( y i − ; Θ i ) = h ( g ( Θ i , y i − )) , (3.2)where g ( Θ i , y i − ) = Θ i [1; y i − ] with Θ i a real-valued matrix for a traditional multilayer perceptron (MLP),while g corresponds to a convolution operator for a convolutional neural network with Θ i providing theconvolutional-filter weights. Note that y := w i . The activation function is applied element-wise to the vectorargument; common choices include the rectified linear unit (ReLU), the logistic sigmoid, and the hyperbolictangent.Analogously, the decoder ψ also takes the form of a feed-forward artificial neural network with ¯ N layers such that ψ ( w i ; ¯ θ ) = ¯ h ¯ N layers ( · ; ¯ Θ ¯ N layers ) ◦ ¯ h ¯ N layers − ( · ; ¯ Θ ¯ N layers − ) ◦ · · · ◦ ¯ h ( w i ; ¯ Θ ) , (3.3)with ¯ h i ( · ; ¯ Θ i ) : R ¯ p i − → R ¯ p i , i = 1 , . . . , ¯ N layers , and ¯ θ ≡ [ vec ( ¯ Θ ); . . . ; vec ( ¯ Θ ¯ N layers )] . The input is ofdimension ¯ p = N ˆ w , el and the final (output) layer ( i = ¯ N layers ) produces the high-dimensional representation ˜ w i = ψ ( ˆ w i ) such that ˜ w i ≈ w i and ¯ p ¯ N layers = N w , el . As before, ¯ h i ( y i − ; ¯ Θ i ) = h ( g ( ¯ Θ i , y i − )) , (3.4)where y i − ∈ R ¯ p i − is the output from the previous layer, and y := ˆ w i .To train the autoencoder, we use data corresponding to the local states at a subset of time instances T autoencoder ⊆ T sample , i.e., the training data corresponds to { w ni } n ∈ T autoencoder i =1 ,...,N el .
5e then compute the weights ( θ (cid:63) , ¯ θ (cid:63) ) by solving the minimization problem minimize ( θ , ¯ θ ) (cid:88) n ∈ T autoencoder N el (cid:88) i =1 (cid:107) w ni − ψ ( · ; ¯ θ ) ◦ φ ( w ni ; θ ) (cid:107) + Ω( θ , ¯ θ ) , (3.5)where Ω denotes a regularization function, e.g., ridge, lasso, elastic-net [61]. We note that this optimizationproblem is often solved approximately to improve generalization behavior; for instance, optimization iterationsare often terminated when the error on an independent validation set begins to increase; see Ref. [4] fora review of training deep neural networks. The encoder and decoder are then set to φ = φ ( · ; θ (cid:63) ) and ψ = ψ ( · ; ¯ θ (cid:63) ) , respectively.After applying the encoder, the global reduced state takes the form ˆ w := vec ( ˆ w , . . . , ˆ w N el ) ∈ R N el N ˆ w , el , (3.6)where ˆ w i = φ ( w i ) , i = 1 , . . . , N el . Note that we can also write ˆ w = φ global ( w ) , where φ global : w (cid:55)→ vec ( φ ( w ) , . . . , φ ( w N el )): R N w → R N el N ˆ w , el . (3.7)The dimension of the global reduced state ˆ w is N el N ˆ w , el ; while this is significantly smaller than the dimension N w = N el N w , el of the full state w , it may still be large if the number of elements N el is large, as is the casefor fine spatial discretizations. To address this, Section 3.2 describes how PCA can be applied to reduce thedimensionality of the global reduced state ˆ w . Because the dimensionality of the global reduced state ˆ w = φ global ( w ) ∈ R N el N ˆ w , el may still be large, weproceed by applying dimensionality reduction to this global quantity. However, in this case we have manyfewer training examples than in the case of local compression. This arises from the fact that local compressionemploys a training set whose cardinality is the product of the number of training time instances and thenumber of elements (which is often large). The size of the global-compression training set is simply the numberof training time instances. Thus, for global compression, we use PCA, a dimensionality reduction methodthat does not rely on access to a large amount of data. Furthermore, because we are usually consideringlarge-scale data, it may be computationally costly to compute global principal components to represent theglobal reduced state vector ˆ w ; thus, we consider piecewise PCA for this purpose.To achieve this, we first define a set of training instances T PCA ≡ { n PCA ,j } j ⊆ T sample and associatedglobal reduced states { ˆ w n } n ∈ T PCA that will be employed to compute the principal components. Then, wedecompose the global reduced state into n ˆ w components ˆ w i , i = 1 , . . . , n ˆ w such that ˆ w ≡ [ ˆ w T · · · ˆ w Tn ˆ w ] T .Next, we compute the singular value decompositions [( ˆ w n PCA , i − ¯ w i ) · · · ( ˆ w n PCA , | T PCA | i − ¯ w i )] = U i Σ i V Ti , i = 1 , . . . , n ˆ w , (3.8)where ¯ w i := 1 | T PCA | (cid:88) n ∈ T PCA ˆ w ni , i = 1 , . . . , n ˆ w (3.9)denote the sample means, and U i ≡ [ u i, · · · u i, | T PCA | ] , Σ i = diag( σ i,j ) | T PCA | j =1 with σ i, ≥ · · · σ i, | T PCA | ≥ . (3.10)Subsequently, PCA truncates the left singular vectors to obtain the basis matrix Φ i = (cid:2) u i, · · · u i,N x,i (cid:3) , (3.11)6here N x,i can be set according to a statistical energy criterion, for example. We note that the resultingbasis matrix satisfies the minimization problemRan ( Φ i ) = arg min S , dim( S )= N x,i | T PCA | (cid:88) j =1 (cid:107) ( I − P S )( ˆ w n PCA ,j i − ¯ w i ) (cid:107) , (3.12)where P S denotes the orthogonal projector (in the Euclidean norm) onto the subspace S ; and minimizationis taken over all subspaces of dimension N x,i (i.e., the Grassmannian).We then compute the singular value decomposition [diag( Φ i )] T [( ˆ w n PCA , − ¯ w ) · · · ( ˆ w n PCA , | T PCA | − ¯ w )] = U Σ V T , (3.13)with U ≡ [ u · · · u | T PCA | ] and ¯ w := 1 | T PCA | (cid:88) n ∈ T PCA ˆ w n . (3.14)We define the global basis matrix Φ := diag( Φ i ) [ u · · · u N x ] ∈ R N el N ˆ w , el × N x with N x ≤ (cid:80) i N x,i (cid:28) N el N ˆ w , el (cid:28) N el N w , el determined from an energy criterion, for example.Now, we can define the reduced global state: x n = r ( w n ) , n = 0 , . . . , N t , (3.15)where r : w (cid:55)→ Φ T ( φ global ( w ) − ¯ w i ) . The mapping from reduced state to the approximated global state isthen p : x (cid:55)→ ψ global ( Φ x + ¯ w i ) , where ψ global : ˆ w (cid:55)→ vec ( ψ ( ˆ w ) , . . . , ψ ( ˆ w N el )): R N el N ˆ w , el → R N w . (3.16)
4. Dynamics learning using regression
This section describes dynamics learning. In particular, we assume that the sequence of low-dimensionalstates { x n } N t n =0 ⊂ R N x with x n defined in Eq. (3.15) can be recovered from the Markovian discrete-timedynamical system x n +1 = f ( x n ) , n = 0 , . . . , N t − (4.1)for some (unknown) discrete-time velocity f : R N x → R N x . Then, if such a velocity exists and can becomputed, the entire sequence of low-dimensional states { x n } N t n =0 ⊂ R N x can be recovered from Eq. (4.1) byspecifying the initial state x only; this allows an approximation to the sequence of states to be computed as { w n } N t n =1 ≈ { p ( x n ) } N t n =1 (note that w is given).Because we do not have direct access to this discrete-time velocity f , we must generate an approximation ˜ f ≈ f with ˜ f : R N x → R N x . This work constructs this approximation by regressing each componentof the velocity. Once the approximated velocity ˜ f is constructed, the sequence of low-dimensional states { x n } N t n =0 ⊂ R N x can be approximated as { ˜ x n } N t n =0 ⊂ R N x , where ˜ x = x and ˜ x n , n = 1 , . . . , N t satisfies theapproximated discrete-time dynamics ˜ x n +1 = ˜ f (˜ x n ) , n = 0 , . . . , N t − . (4.2)Then, the desired sequence of states { w n } N t n =0 can be approximated as { ˜ w n } N t n =0 , where ˜ w = w and ˜ w n = p (˜ x n ) , n = 1 , . . . , N t . 7 .1. Regression setting To cast this as a regression problem, we consider the i th equation of (4.1): x n +1 i = f i ( x n ) , n = 0 , . . . , N t − , i = 1 , . . . , N x , (4.3)where x ≡ [ x · · · x N x ] T and f ≡ [ f · · · f N x ] T . This expression shows that the i th component of thevelocity f i can be interpreted as a function that maps the reduced state at the current time step x n to the i th element of the state at the next time step x n +1 i for n = 0 , . . . , N t − .Based on this observation, we construct independent regression models ˜ f i ≈ f i , i = 1 , . . . , N x whose features (i.e., regression-model inputs) correspond to the state at the current time step x n , and whose response (i.e., regression-model output) corresponds to the i th element of the state at the next time step x n +1 i for n = 0 , . . . , N t − . The approximated velocity is then ˜ f ≡ [ ˜ f · · · ˜ f N x ] T . Section 4.2 describes the trainingdata used to construct the regression models, and Section 4.3 describes the proposed regression models. Based on the regression setting described in Section 4.1, it is clear that the appropriate training data forconstructing the i th regression model ˜ f i corresponds to response–feature pairs T train , i := { ( x n +1 i , x n ) } n ∈ T reg , where T reg ≡ { n reg ,j } j ⊂ T sample satisfies n reg ,j + 1 ∈ T sample , j = 1 , . . . , n train , where n train := | T reg | = |T train , i | , i = 1 , . . . , N x such that sequential states (required for accessing both the features and the response)are accessible from the available sample set T sample . In this work, we consider a wide range of types of models for constructing regression models ˜ f i , i = 1 , . . . , N x .We denote a generic regression model as ˜ f and its training data as T train ≡ { (¯ y i , ¯ x i ) } n train i =1 (4.4)For all candidate models, we employ validation for model selection; this process chooses values of thehyperparameters characterizing each model. Support vector regression (SVR) [42] employs a model ˜ f ( x ; θ ) = (cid:104) θ , ψ ( x ) (cid:105) + b, (4.5)where ψ : R N x → H , θ ∈ H , and H is a (potentially unknown) feature space equipped with inner product (cid:104)· , ·(cid:105) . SVR aims to compute a ‘flat’ function (i.e., (cid:104) θ , θ (cid:105) small) that penalizes prediction errors that exceed aspecified threshold (cid:15) (i.e., soft margin loss). SVR uses slack variables ξ and ξ (cid:63) to address deviations exceedingepsilon margin ( (cid:15) ) and employs the box constraint ( C ) to penalize these deviations, leading to the primalformula [9] minimize ¯ θ ,b, ξ , ξ (cid:63) (cid:104) ¯ θ , ¯ θ (cid:105) + C n train (cid:88) i =1 ( ξ i + ξ (cid:63)i ) , subject to ¯ y i − (cid:104) ¯ θ , ψ (¯ x i ) (cid:105) − b ≤ (cid:15) + ξ i , i = 1 , . . . , n train , (cid:104) ¯ θ , ψ (¯ x i ) (cid:105) + b − ¯ y i ≤ (cid:15) + ξ (cid:63)i , i = 1 , . . . , n train , ξ , ξ (cid:63) ≥ , (4.6)8hose corresponding dual problem isminimize α , α (cid:63)
12 ( α − α (cid:63) ) T Q ( α − α (cid:63) ) + (cid:15) T ( α + α (cid:63) ) − [¯ y · · · ¯ y n train ] T ( α − α (cid:63) ) , subject to T ( α − α (cid:63) ) = 0 , ≤ α i , α (cid:63)i ≤ C, i = 1 , . . . , n train . Here, Q ij := K (¯ x i , ¯ x j ) := (cid:104) ψ (¯ x i ) , ψ (¯ x j ) (cid:105) and θ = (cid:80) n train i =1 ( α i − α (cid:63)i ) ψ (¯ x i ) . The resulting model (4.5) can beequivalently expressed as ˜ f ( x ; θ ) = n train (cid:88) i =1 ( α i − α (cid:63)i ) K (¯ x i , x ) + b. (4.7)There exist many kernel functions K (¯ x i , ¯ x j ) that correspond to an inner product in a feature space H . Weconsider both a Gaussian radial basis function (RBF) kernel K (¯ x i , ¯ x j ) = exp( − γ (cid:107) ¯ x i − ¯ x j (cid:107) ) and polynomialkernel K (¯ x i , ¯ x j ) = (1 + [¯ x i ] T ¯ x j ) q . When the RBF kernel is used, we refer to the method as SVRrbf; whenusing the polynomial kernel with q = 2 or , we refer to the method SVR2 and SVR3, respectively. In thenumerical experiments, we apply a box constraint C = 1 and select the sensitivity margin (cid:15) using a validationset. Random forests (RF) [5] use decision trees constructed by decomposing feature space (in this case R N x )along canonical directions in a way that sequentially minimizes the mean-squared prediction error over thetraining data. The prediction generated by a decision tree corresponds to the average value of the responseover the training data that reside within the same feature-space region as the prediction point.Decision trees are low bias and high variance; thus, random forests employ a variance-reduction method,namely bootstrap aggregating (i.e., bagging), to reduce the prediction variance. Here, N tree different datasets are generated by sampling the original training set T train with replacement. The method then constructsa decision tree from each of these training sets, yielding N tree regression functions ˜ f i , i = 1 , . . . , N tree . Theultimate regression function corresponds to the average prediction across the ensemble: ˜ f ( x ) = 1 N tree N tree (cid:88) ˜ f i ( x ) . (4.8)To decorrelate the decision trees and further reduce prediction variance, random forests introduce an additionalsource of randomness. When training each tree, a random subset of N split (cid:28) N x features is considered whenperforming the feature-space split at each node in the tree.The hyper-parameter we consider for this approach corresponds to the number of trees in the ensemble N tree . We set N split = N x / for splitting during training. Boosted decision trees [13] combines weak learners (decision trees) into a single strong learner in aniterative fashion. The algorithm adds one tree at each stage. Let F m be the aggregate model at stage m ,where m = 1 , . . . , N tree . The gradient boosting algorithm improves upon F m by constructing a new modelthat adds a weak learner h m to yield a more accurate model F m +1 ( x ) = F m ( x ) + α m h m ( x ) , where α m is aconstant.The numerical experiments employ LSBoost, where the goal is to minimize the mean squared error (cid:80) n train i =1 | ˜ f (¯ x i ) − ¯ y i | . Decision stumps are applied as weak learners h m ( x ) . Initially, F = h ( x ) . Atiteration m (for m = 1 , · · · , N tree ), compute the residual r i = ¯ y i − F m − (¯ x i ) ( i = 1 , . . . , n train ) and solve theoptimization problem ( α m , h m ) = arg min α,h n train (cid:88) i =1 | r i − αh (¯ x i ) | . ˜ f ( x ) = F N tree ( x ) .The hyper-parameter we consider for this approach corresponds to the number of trees N tree in theensemble. k -nearest neighbors The k -nearest neighbors ( k -NN) method [1] produces predictions corresponding to a weighted average ofthe responses associated with the k -nearest training points in feature space, i.e., ˜ f ( x ) = (cid:88) i ∈I ( k ) τ (¯ x i , x )¯ x i . Here, I ( k ) ⊆ { , . . . , n train } with |I ( k ) | = k ( ≤ n train ) satisfies (cid:107) x − ¯ x i (cid:107) ≤ (cid:107) x − ¯ x j (cid:107) for all i ∈ I ( k ) , j ∈ { , . . . , n train } \ I ( k ) . In the numerical experiments, we consider only uniform weights τ (¯ x j , x ) := k .The hyper-parameter we consider for this method corresponds to the number of nearest neighbors k . In support vector machines, kernel-based interpolation is applied to scalar-valued functions (as in Eq. 4.7). Ifeach individual regression model ˜ f i , i = 1 , . . . , N x requires n train sets of kernels to construct its approximation,the resulting regression model for the vector of responses ˜ f would require N x n train independent kernels. Thisis expensive for both training and evaluation when N x (cid:29) . To reduce the overall number of kernels, one canimpose the restriction that a common subspace be used for every component of the vector-valued response.In particular, the vectorial kernel orthogonal greedy algorithm (VKOGA) [58, 59] yields a regression modelof the form ˜ f ( x ) = n VKOGA (cid:88) i =1 α i K ( x i , x ) . (4.9)where K : R N x × N x → R denotes a kernel function, ¯ x i ∈ R N x , i = 1 , . . . , n VKOGA denotes the kernel centersand α i ∈ R N x , i = 1 , . . . , n VKOGA denote vector-valued basis functions.VKOGA first computes kernel functions K (¯ x i , · ) by a greedy algorithm. The greedy algorithm determineskernel centers from Ω = { ¯ x , . . . , ¯ x n train } . Initially, let Ω = ∅ . At stage m , choose x m := argmax x ∈ Ω \ Ω m − |(cid:104) f , φ m − x (cid:105)| , where φ m − x denotes the orthogonal remainer of K ( x , · ) with respect to the reproducing kernel Hilbert spacespanned by { K ( x , · ) , . . . , K ( x m − , · ) } . Then Ω m = Ω m − ∪ { x m } . After the kernel centers have beencomputed, the basis functions α i , i = 1 , . . . , n VKOGA are determined by a least-squares approximation tobest fit training data.In the numerical experiments, we apply Gaussian RBF kernel K (¯ x i , ¯ x j ) = exp( − γ (cid:107) ¯ x i − ¯ x j (cid:107) ) . Thehyper-parameter we consider for this method corresponds to the number of kernel functions n VKOGA . The sparse identification of nonlinear dynamics (SINDy) method [6] is an application of linear regressionto learning dynamics. Although SINDy was originally devised for continuous-time dynamics, it can be easilyextended to the discrete-time case. In this context, it constructs a model ˜ f ( x ; θ ) = n SINDy (cid:88) i =1 g i ( x ) θ i (4.10)where g i : R N x → R , i = 1 , . . . , n SINDy denote a “library” of prescribed basis functions. The least absoluteshrinkage and selection operator (LASSO) [46] is used to determine a sparse set of coefficients ( θ , . . . , θ n SINDy ) .Hyperparameters in this approach consist of the selected basis functions. For simplicity, we consider onlylinear and quadratic functions in the library, i.e., g i ∈ { x , . . . , x N x , x x , x x , . . . , x N x x N x } .10 .3.7. Dynamic mode decomposition (DMD) Dynamic mode decomposition (DMD) [41] computes the approximated velocity ˜ f as the linear operator ˜ f : x (cid:55)→ [ x n reg , · · · x n reg ,n train+1 ][ x n reg , · · · x n reg ,n train ] + x (4.11)where the superscript + denotes the Moore–Penrose pseudoiverse. Critically, note that this approach ensuresthat x n +1 = ˜ f ( x n ) , n ∈ T reg if N x ≥ n train . This section provides analysis related to the stability of the approximated discrete-time dynamical system(4.2) when the different regression methods described in Section 4.3 are employed to generate the approximatedvelocity ˜ f .First, we note that a function f : X → R N x is bounded if the set of its values is bounded. In other words,there exists a real number M such that (cid:107) f ( x ) (cid:107) ≤ M for all x ∈ X . An important special case is a boundedsequence, where X is taken to be the set N of natural numbers. Thus a sequence ( x , x , x , . . . ) is bounded if there exists a real number M such that (cid:107) x ( t ) (cid:107) ≤ M for every natural number t . With ˜ x n +1 = ˜ f (˜ x n ) , boundedness of ˜ f : R N x → R N x yields boundedness of the sequence (˜ x , ˜ x , ˜ x , . . . ) . Lemma 4.1.
Let I be an index set. With ρ i : R N x → R , { ρ i } i ∈ I is a set of scalar basis functions. Suppose ˜ f ∈ span ( { ρ i } i ∈ I ) , i.e., there exist α i ∈ R , i = 1 , . . . , n (with n finite) such that ˜ f ( x ) = n (cid:88) i =1 α i ρ i ( x ) . If each basis function ρ i is bounded, then ˜ f is bounded. Proof.
Let α = max ni =1 (cid:107) α i (cid:107) . Since ρ i ( · ) is bounded, (cid:107) ρ i ( x ) (cid:107) ≤ M i for all x ∈ R N x . Let M = max ni =1 M i .Thus, (cid:107) ˜ f ( x ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 α i ρ i ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ n (cid:88) i =1 (cid:107) α i ρ i ( x ) (cid:107) ≤ n (cid:88) i =1 (cid:107) α i (cid:107) · (cid:107) ρ i ( x ) (cid:107) ≤ nαM. The last expression provides an upper bound for (cid:107) ˜ f ( x ) (cid:107) . Corollary 4.1.1.
If the approximated velocity ˜ f is constructed using SVRrbf (as described in Section 4.3.1)or VKOGA (as described in Section 4.3.5), then the approximated velocity is bounded. Proof.
Since a Gaussian kernal satisfies K (¯ x i , x ) = exp( − γ (cid:107) ¯ x i − x (cid:107) ) ≤ for all x ∈ R N x , it is bounded.By the previous lemma, with ρ i ( x ) = K (¯ x i , x ) , the mapping ˜ f is also bounded.With the previous lemma, as long as the basis function is bounded, then the function ˜ f is also bounded.Thus, k -NN, Random forest, and Boosting can also yield a bounded sequence ( x (0) , x (1) , x (2) , . . . ) . In contrast,basis functions associated with DMD are linear, basis functions associated with SINDy are polynomial, andbasis functions associated with SVR with polynomial kernels are K (¯ x i , ¯ x j ) = (1 + [¯ x i ] T ¯ x j ) q , none of whichare bounded.
5. Numerical experiments
To assess the proposed methodology, we consider a large-scale CFD problem with a high-order discontinuousGalerkin discretization. 11 igure 1: The half-cylinder mesh characterized by
40 584 elements.
Our chosen test case is the flow over an extruded half-cylinder at Reynolds number
Re = 350 and aneffectively incompressible Mach number of M = 0 . . This case contains several complex flow features,including separated shear layers, turbulent transition, and a fully turbulent wake. When compared withthe circular cylinder, the dynamics of the half-cylinder are noteworthy because the point of separation isessentially fixed. Flow over such a cylinder has been the subject of several numerical and experimental studies[33, 27, 40].In the present study, the cylinder is taken to have a diameter D along its major axis. The domain istaken to be [ − D, D ] , [ − D, D ] , and [0 , πD ] in the stream-, cross-, and span-wise directions, respectively.The cylinder is positioned such that the back surface is centered at (0 , , . The stream-wise and cross-wisedimensions are comparable to those used in studies of the circular cylinder [36, 60]. The domain is periodic inthe span-wise direction. We remark here that, at this Reynolds number, the dynamics are strongly influencedby the span-wise extent of the domain. Indeed, for extrusions less than ∼ πD/ , the wake is coherent. Weapply a no-slip adiabatic wall boundary condition at the surface of the cylinder and Riemann invariantboundary conditions at the far-field. To non-dimensionalize the system, we take the cylinder diameter to be D = 1 , the free-stream density to be one, and the free-stream pressure to be one. The free-stream velocity isthus v ∞ = √ γM (cid:39) . , where γ = 1 . is the ratio of specific heats.We mesh the domain using quadratically curved hexahedral elements. In total the mesh has N el = 40 584 elements. An overview of the mesh can be seen in Figure 1. We solve the compressible Navier–Stokes equationson this mesh with viscosity fixed such that the Reynolds number based on the diameter of the cylinder is Re = 350 . For the spatial discretization, we employ a high-order nodal DG scheme with third-order solutionpolynomials and Gauss–Legendre solution points. Due to the third-order solution polynomials, the local statewithin each element contains the density, x -momentum, y -momentum, z -momentum, and energy at 64 points,leading to N w , el = 5 ×
64 = 320 . Thus, the resulting dimension of the global state is N w ≈ × . Thisimplies that saving a given solution snapshot to disk consumes ∼ MiB of storage using double precisionarithmetic. We calculate inviscid fluxes between elements using a Rusanov-type Riemann solver, and we takethe Prandtl number to be
Pr = 0 . . For time integration, we employ the explicit five-stage fourth-orderRK45[2R+] scheme of Carpenter and Kennedy [23], which manages the local temporal error by utilizing a PItype controller to adaptively modify the time step.We start the simulation cold with a constant initial condition corresponding to the free-stream at t = 0 and terminate the simulation at t = 600 . This corresponds to approximately stream-wise passes over thehalf-cylinder. Our time grid of interest comprises t ∈ { .
05 + 0 . i } N t i =1 with N t = 8 600 total time instances.Thus, the desired sequence of states { w n } N t n =0 ⊂ R N w corresponds to the CFD solution at these points intime. 12 .2. Dimensionality Reduction As described in Section 3, the first stage of our proposed methodology is dimensionality reduction, whichproceeds in two steps: local compression using autoencoders, and global compression using PCA.
Because convolutional neural networks were originally devised for image data, they require the localdegrees of freedom (i.e., the five conserved variables) to be equispaced within each element. However, withinthe simulation, these variables are defined on the tensor product of a 4-node Gauss–Legendre quadraturerule in order to minimize aliasing errors when projecting the non-linear flux into the polynomial space ofthe solution. Thus, before applying the autoencoder, we transform the solution by interpolating it fromthese Gauss–Legendre points to a × × grid of equispaced points. The dimension of the network inputsis therefore × × × , where the first three dimensions correspond to spatial dimensions and the lastdimension corresponds to the flow quantities. We do not account for the fact that the mesh elements canvary in size, and hence the spacing between points is not uniform across all training examples. Since solutionvalues at both sets of points define the same polynomial interpolant, the solution inside of each element isunchanged by this transformation.For the autoencoder, we apply convolutional layers with ResNet skip connections, as they have previouslybeen shown to achieve state-of-the-art performance on image classification and related tasks [14]. Betweenconvolutional layers, we apply batch normalization [19] and ReLU activations [32]. The encoder network has N layers = 5 , where each layer contains parameters associated with a set of three-dimensional convolutionalfilters Θ i ∈ R × × × f i , and f i denotes the number of filters contained in layer i . There are , , , ,and filters in the five layers of the encoder. This architecture is representative of a standard autoencoder [16],where the dimensionality of the transformed input is initially large to allow for adequate feature extraction,and is subsequently decreased gradually through the remaining hidden layers of the encoder. While we foundthat this autoencoder architecture allowed for a high level of compression and reconstruction accuracy, it ispossible that other autoencoder architectures may achieve satisfactory performance. In all layers, the size ofthe final output dimension is equal to f i . In the first four layers of the encoder, the convolutions are performedwith a stride of one, meaning that the first three dimensions of the layer output h i ( y i − ; Θ i ) are equal in sizeto the first three dimensions of the layer input y i − . In the final layer of the encoder, the convolutions areperformed with a stride of two, meaning that the first three dimensions of the layer output h ( y ; Θ ) arehalf the size of the first three dimensions of the layer input y . Given all of these transformations, we have p = 320 , p = 32 768 , p = 8 192 , p = 2 048 , p = 1 024 , and p = 64 .The output of the final convolutional layer is reshaped into a vector and subsequently mapped to encoding ˆ w i with an affine transformation. We set the reduced dimension of the local state vector (i.e., the codedimensionality) to N ˆ w , el = 24 . We then use an affine transformation and reshaping operation to create aninput to the decoder that is the same size as the output of the encoder. We construct the decoder network toinvert all operations performed by the encoder in order to obtain reconstructed solutions; this implies that ¯ N layers = 5 , ¯ p = 64 , ¯ p = 1 024 , ¯ p = 2 048 , ¯ p = 8 192 , ¯ p = 32 768 , and ¯ p = 320 . Figure 2 provides anillustration of the network architecture.The training data for the autoencoder consists of the 90 randomly selected values of the set of timeinstances { i } i =1 such that | T autoencoder | = 90 ; the remaining 10 points of the set comprise validation datathat is used to check for overfitting. To solve the minimization problem (3.5), we employ stochastic gradientdescent with the Adam optimizer [24]. We apply L regularization to the network weights Ω( θ , ¯ θ ) = λ (cid:0) (cid:107) θ (cid:107) + (cid:107) ¯ θ (cid:107) (cid:1) , (5.1)where λ is a hyperparameter that controls the level of regularization. When an increase in the validationerror is observed across training epochs, the learning rate is cut in half. Training is terminated once the losson the validation set fails to decrease across three training epochs, which typically occurs after the learningrate has been decayed many times. 13 D C o n v o l u t i o n I npu t w i ∈ R × × × L a y e r f = L a y e r f = L a y e r f = L a y e r f = L a y e r f = E n c o d e r φ ( w i ; θ ) : Θ j ∈ R × × × f j R e s h a p e + T r a n s f o r m T r a n s f o r m + R e s h a p e C o d e ˆ w i ∈ R L a y e r ¯ f = L a y e r ¯ f = L a y e r ¯ f = L a y e r ¯ f = L a y e r ¯ f = D ec o d e r ψ ( · ; ¯ θ ) : Θ j ∈ R × × × ¯ f j O u t pu t ψ ( · ; ¯ θ ) ◦ φ ( w i ; θ ) ∈ R × × × F i g u r e : I ll u s t r a t i o n o f t h e n e t w o r k a r c h i t ec t u r e . I n t h ee n c o d e r , t h e i npu t i s t r a n s f o r m e db y a s e r i e s o f c o n v o l u t i o n a ll a y e r s w i t h a d ec r e a s i n g nu m b e r o f fi l t e r s a t e a c h l a y e r . T h ec o d e i s o b t a i n e db y r e s h a p i n g t h e o u t pu t o f t h e fin a l e n c o d e r l a y e r i n t oa v ec t o r a ndp e r f o r m i n ga n a ffi n e t r a n s f o r m a t i o n . T h e d ec o d e r o u t pu t s a r ec o n s t r u c t i o n o f t h e o r i g i n a li npu t w i b y p e r f o r m i n g t h e i n v e r s e o f a ll o p e r a t i o n s p e r f o r m e db y t h ee n c o d e r . k: number of order -3 -2 -1 R e l a t i v e s i ngu l a r v a l ue s R e l a t i v e s i n g u l a r v a l u e i /
25 632 , i = 1 , . . . , . We truncate the componentprincipal components such that N x,i = 500 , i = 1 , . . . , .Figure 3 reports the decay of the first 500 (of ) singular values (i.e., the first 500 diagonal elements of Σ in Eq. (3.13)). In this case, the first 100 principal components capture approximately 95% of the statisticalenergy (as measured by the sum of squares of all singular values) in the data, while the first 500 principalcomponents capture approximately 98% of the statistical energy. We first set N x = 100 . We now apply the approach described in Section 4 for dynamics learning. We employ a maximum setof training time instances T reg with | T reg | = 6 450 . Before applying regression, we standardize the featuresby applying an affine transformation (i.e., we subtract the sample mean and divide by the sample standarddeviation). We now present validation results for the regression methods with tunable hyperparameters, i.e., SVR2,SVR3, SVRrbf, random forest, boosting, k -NN, and VKOGA. We set the validation set to be the timeinstances not included in the maximum training set.Figure 4 reports these results, wherein the models are trained using the maximum training set T reg . Here,the relative mean-squared error (MSE) over a set of time indices T ⊆ { , . . . , N t } is defined as (cid:80) n ∈ T (cid:107) ˜ f ( x n ) − x n +1 (cid:107) (cid:80) n ∈ T (cid:107) x n +1 (cid:107) . (5.2)Based on these results, we select the following hyperparameters: (cid:15) = 10 − for SVR2, SVR3, and SVRrbf; N tree = 30 for random forest; N tree = 100 for boosted decision trees; k = 6 for k -nearest neighbors; and 200kernel functions for VKOGA. Note that PCA is robust with respect to missing data points; thus, we expect the principal components computed from thisset to be close to those computed on a smaller training set. a) SVR2 (b) SVR3 (c) SVRrbf(d) Random Forest (e) Boosted decision trees (f) k -nearest neighbors(g) VKOGA Figure 4: Hyperparameter selection for different regression methods. Reported errors correspond to the relative MSE values onthe training set (blue) and the validation set (red). The training set corresponds to the maximum training set T reg , and thevalidation set corresponds to the remaining time instances. .3.2. Training and testing error We now analyze the training and testing errors of the various regression methods, where the hyperparam-eters have been fixed according to the values selected using validation in Section 5.3.1. Figure 5 reports theseresults. Here, three different training sets are considered to construct the regression models: one with themaximum training set T reg with | T reg | = 6 450 (yellow bars), one with 75% of the maximum training dataselected at random (red bars), and one with 50% of the maximum training data selected at random (bluebars). The reported errors correspond to the relative MSE as defined in Eq. (5.2). (a) Training error (b) Testing error Figure 5: Training and testing relative MSE errors for all regression methods. We consider models constructed using threetraining sets: one with the maximum training set T reg with | T reg | = 6 450 (yellow bars), one with 75% of the maximum trainingdata selected at random (red bars), and one with 50% of the maximum training data selected at random (blue bars). First, note that the training and testing errors are extremely similar; this suggests that the training dataare indeed representative of the testing data, indicating that we have not overfit our models.Second, note that even with the smallest training set characterized by only 50% of the maximum numberof considered points training set (i.e., time instances), the training and test errors are quite similarto those achieved with the maximum training set. Indeed, the only regression methods where doubling thetraining set yields improvements are random forest, and k -NN. Even in these cases, the additional trainingdata provide modest gains. In the case of SVR2, including additional training data actually degrades boththe training and testing errors. This likely occurs because the SVR2 model is trained using the hinge loss(see Eq. (4.6)), which is different from the relative MSE loss reported here. The hinge loss does not penalizesmall errors, which do influence the relative MSE.Third, note that the smallest errors (relative MSE on the test set < × − ) are achieved by SVRrbf,VKOGA, and SINDy. We expect these techniques to perform best when attempting to recover the desiredsequence of states by simulating the approximated discrete-time dynamics (4.2). However, as discussed inSection 4.4, only VKOGA and SVRrbf are bounded; SINDy admits unbounded dynamics. We now turn to the final stage of the method: computing the sequence of low-dimensional approximatedstates { ˜ w n } N t n =0 according to approximated discrete-time dynamics (4.2), and subsequently computingthe approximation to the desired sequence of states, i.e., { ˜ w n } N t n =0 , where ˜ w = w and ˜ w n = p (˜ x n ) , n = 1 , . . . , N t . 17igure 6 reports these results. For each regression model (constructed with the maximum training set T reg ), this figure reports the time-instantaneous relative error (cid:107) ˜ x n − x n (cid:107) / (cid:107) x n (cid:107) over all time instances,i.e., for n = 0 , . . . , N t . These results show that both VKOGA and SVRrbf yield accurate responses overtime, as the time-instantaneous relative error is less than 5% and 3% for these models, respectively, at alltime instances. These results are sensible, as their training and testing errors were also small, as reportedin Figure 5. Somewhat surprisingly, note that SINDy—despite yielding small training and testing errors inFigure 5—produces an unstable response when integrated into the approximated discrete-time dynamics(4.2). This can be explained from the boundedness analysis in Section 4.4: the polynomial basis functionsemployed with the SINDy method are unbounded. Thus, this approach does not ensure a bounded sequence ofapproximated states, and—in this case—produces unstable discrete-time dynamics as a result. This highlightsthat the stability properties of the ultimate dynamical system should play an important role in deciding onthe appropriate regression model; low training and testing errors are not sufficient for an accurate model inthis context. Figure 6: The time-instantaneous relative error (cid:107) ˜ x n − x n (cid:107) / (cid:107) x n (cid:107) , n = 1 , . . . , N t over time for each considered regressionmodel. Regression models are trained using the maximum training set T reg . The low-dimensional state dimension is N x = 100 .Note that both SVRrbf and VKOGA yield accurate responses, which are consistent with their small training and testing errors.In contrast, SINDy produces an unstable response despite its small training and testing errors; this is due to the fact that ityields an unbounded approximated velocity. We emphasize the promise of these results: the original sequence of data—wherein each vector hasdimension N w ≈ × —has been accurately approximated as dynamical system of dimension N x = 100 ,which corresponds to a compression ratio of
130 000 : 1 . This implies that—with access only to the initialstate w n , the proposed method can recover the desired sequence of states to within 5% accuracy in thelow-dimensional state with knowledge of only the restriction operator r and approximated velocity ˜ f .We now consider increasing the dimension of the low-dimensional state to gauge the effect of accuracy onthis parameter. To achieve this, after computing the principal components as described in Section 5.2.2, wenow preserve the first N x = 500 principal components, yielding a compression ratio of
26 000 : 1 wherein wehave captured 98% of the statistical energy. Rather than repeat the experiment for all considered regressionmethods, we now only consider the VKOGA method, as it (along with SVRrbf) yielded the best results in thecase of N x = 100 . Figure 7 reports these results; comparing this figure with Figure 6 shows that increasingthe dimension of the low-dimensional state from N x = 100 to N x = 500 (and thus increasing the capturedstatistical energy in the global PCA step from 95% to 98%) has reduced the relative error by nearly an orderof magnitude. The response associated with N x = 500 now exhibits time-instantaneous relative errors below . over all time. 18 igure 7: The time-instantaneous relative error (cid:107) ˜ x n − x n (cid:107) / (cid:107) x n (cid:107) , n = 1 , . . . , N t for VKOGA trained using the maximumtraining set T reg . We now employ a low-dimensional state dimension of N x = 500 . Note that the errors have decreased byroughly one order of magnitude relative to the case of N x = 100 (compare with Figure 6). ,
000 4 ,
000 6 ,
000 8 , . . . · − Time index n R e l a t i v ee rr o r ψ global ( · ) ◦ φ global ( w n )˜ w n Figure 8: The time-instantaneous relative error for the autoencoder alone (cid:107) w n − ψ global ( · ) ◦ φ global ( w n ) (cid:107) / (cid:107) w n (cid:107) , n = 1 , . . . , N t ,as well as for the approximated solutions (cid:107) w n − ˜ w n (cid:107) / (cid:107) w n (cid:107) , n = 1 , . . . , N t . The L norm is used because numerical roundofferrors cause spikes in time instances when (cid:107) w n (cid:107) is small. w n once the reduced state values are passed through theprolongation operator p . The prediction error across all time instances for the approximated solutions ˜ w n can be found in Figure 8. To put these results in context, we compare against the error obtained by passingeach global state through the encoder and decoder without performing the PCA and dynamics-learningprocedures. We note that the error introduced by compressing and reconstructing the global states throughthe autoencoder introduces very little error, with a relative error of less than . across all time instances.More critically, we can see that the relative error for the approximated states is nearly identical to theerror from the autoencoder reconstructions, meaning that very little error is introduced by the PCA anddynamics-learning steps.Beyond solely considering the global error in reconstructions, we are also interested in whether localproperties of the fluid flow are preserved during the various stages of dimensionality reduction, dynamicspropagation, and reconstruction. One manner of determining whether flow properties are preserved is toconsider the aggregate lift and drag forces that act on the surface of the half-cylinder over time. By extractingpressure values on the surface of the cylinder, we can resolve the distribution of forces acting on the cylinderin the downstream and cross-stream directions. Integrating over the surface of the half-cylinder, we obtain thelift and drag coefficients for the cylinder. The left column of Figure 9 reports these lift and drag forces plottedacross all time instances for the CFD solutions w n , autoencoder reconstructions ψ global ( · ) ◦ φ global ( w n ) , andapproximated solutions ˜ w n for n = 1 , . . . , N t . The right column of Figure 9 reports the error in lift anddrag forces associated with the reconstructed solutions generated by the autoencoder and the approximatedsolutions. The absolute error is provided for lift values rather than the relative error due to some lift valuesbeing close to zero. From these results, we first note that the error in lift and drag introduced by encodingand decoding solutions with the autoencoder is quite low: the absolute error is less than × − across alltime steps for lift values and the relative error is less than . for drag values. Furthermore, we note thatthe principal component analysis and dynamics-learning procedures introduce very little additional error inthese quantities beyond what is introduced by the autoencoder.We now provide several visualizations that lend qualitative insight into the performance of the method.Figure 10 compares instantaneous iso-surfaces of Q-criterion colored by velocity magnitude for the CFDsolution, the autoencoder reconstruction, and the approximated solution. The figure shows that for allconsidered cases, the autoencoder reconstruction retains all of the key vortical structures of the CFD solution,albeit with some noise in the iso-surfaces. We also see that there is no distinguishable difference betweenthe autoencoder reconstruction and the approximated solution. This further supports our assertion that theprincipal component analysis and dynamics-learning procedures introduce very little additional error.
6. Conclusions
This article presented a novel methodology for recovering missing CFD data given that the solution hasbeen saved to disk at a relatively small number of time steps.The first stage of the methodology—hierarchical dimensionality reduction—comprises a novel two-steptechnique tailored for high-order discretizations. In the first step, we apply autoencoders for local compression;this step computes a low-dimensional, nonlinear embedding of the degrees of freedom within each element ofthe high-order discretization. In the numerical experiments, we employed a convolutional neural networkfor this purpose. Results showed that the autoencoder reduced the dimensionality of the element-localstate from N w , el = 320 to N ˆ w , el = 24 without incurring significant errors (see Figures 9 and 10). In thesecond dimensionality-reduction step, we apply principal component analysis to compress the global vector ofencodings. Results showed that this second step was able to reduce the number of global degrees of freedomfrom N ˆ w , el N el = 24 ×
40 584 ∼ to only 500, constituting a compression ratio of
26 000 : 1 , while retainingvery high levels of accuracy.The second stage of the methodology—dynamics learning—applied regression methods from machinelearning to learn the discrete-time dynamics of the low-dimensional state. We considered a wide range ofregression methods for this purpose, and found that support vector regression with a radial-basis-function20 ,
000 4 ,
000 6 ,
000 8 , · − · − · − · − . n L i f t w nψ global( · ) ◦ φ global( w n )˜ w n (a) Lift predictions ,
000 4 ,
000 6 ,
000 8 , . . . . · − Time index n A b s o l u t ee rr o r li f t ψ global( · ) ◦ φ global( w n )˜ w n (b) Lift prediction error ,
000 4 ,
000 6 ,
000 8 , . . . .
22 Time index n D r ag w nψ global( · ) ◦ φ global( w n )˜ w n (c) Drag predictions ,
000 4 ,
000 6 ,
000 8 , · − Time index n R e l a t i v ee rr o r d r ag ψ global( · ) ◦ φ global( w n )˜ w n (d) Drag prediction error Figure 9: Lift and drag forces on the surface of the cylinder across all time instances. The left column shows lift and dragpredictions, while the right column presents the error in lift and drag predictions. Absolute error is used rather than relativeerror to avoid numerical issues when lift values are close to zero. kernel (SVRrbf) and the vectorial kernel orthogonal greedy algorithm (VKOGA) yielded the best performance(see Figure 6). Although the sparse identification of nonlinear dynamics (SINDy) yielded low regression errorson an independent test set (see Figure 5), it did not produce an accurate solution approximation due to itsunboundedness; indeed, the resulting trajectory was unstable (see Figure 6).Ultimately, applying the proposed methodology with VKOGA and a low-dimensional state dimensionof 500 satisfied the objective of this work, as it enabled the original CFD data to be reconstructed withextremely high levels of accuracy, yielding low errors in the state, lift, and drag, as well as solution fields thatmatch well visually (see Figures 8, 9, and 10).Future work involves introducing parameterization into the problem setting such that the reconstructiontask can be performed across multiple CFD simulations characterized by different parameters, e.g., differentboundary conditions, geometric shapes, and operating conditions.21 a) CFD solution w n (b) Autoencoder reconstruction ψ global ( · ) ◦ φ global ( w n ) (c) Approximated solution ˜ w n Figure 10: Instantaneous iso-surfaces of Q-criterion colored by velocity magnitude at n = 1 000 , and . Acknowledgments
K. Carlberg’s and L. Peng’s research was sponsored by Sandia’s Advanced Simulation and Computing (ASC)Verification and Validation (V&V) Project/Task
ReferencesReferences [1]
N. S. Altman , An introduction to kernel and nearest-neighbor nonparametric regression , The AmericanStatistician, 46 (1992), pp. 175–185.[2]
J.-M. Beckers and M. Rixen , EOF calculations and data filling from incomplete oceanographicdatasets , Journal of Atmospheric and oceanic technology, 20 (2003), pp. 1839–1856.[3]
W. Böhmer, J. T. Springenberg, J. Boedecker, M. Riedmiller, and K. Obermayer , Au-tonomous learning of state representations for control: An emerging field aims to autonomously learn staterepresentations for reinforcement learning agents from their real-world sensor observations , KI-KünstlicheIntelligenz, 29 (2015), pp. 353–362. 224]
L. Bottou, F. E. Curtis, and J. Nocedal , Optimization methods for large-scale machine learning ,SIAM Review, 60 (2018), pp. 223–311.[5]
L. Breiman , Random forests , Machine Learning, 45 (2001), pp. 5–32.[6]
S. L. Brunton, J. L. Proctor, and J. N. Kutz , Discovering governing equations from data bysparse identification of nonlinear dynamical systems , Proceedings of the National Academy of Sciences,(2016), p. 201517384.[7]
T. Bui-Thanh, M. Damodaran, and K. Willcox , Aerodynamic data reconstruction and inversedesign using proper orthogonal decomposition , AIAA Journal, 42 (2004), pp. 1505–1516.[8]
K. Carlberg, M. Barone, and H. Antil , Galerkin v. least-squares Petrov–Galerkin projection innonlinear model reduction , Journal of Computational Physics, 330 (2017), pp. 693–734.[9]
C. Chung Chang and C. Jen Lin , LIBSVM: A library for support vector machines , ACM Transactionson Intelligent Systems and Technology, 2 (2001).[10]
B. Cockburn and C.-W. Shu , The Runge-Kutta local projection P1-discontinuous-galerkin finiteelement method for scalar conservation laws , ESAIM: Mathematical Modelling and Numerical Analysis,25 (1991), pp. 337–361.[11]
R. Goroshin, M. F. Mathieu, and Y. LeCun , Learning to linearize under uncertainty , in Advancesin Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, andR. Garnett, eds., Curran Associates, Inc., 2015, pp. 1234–1242.[12]
H. Gunes, S. Sirisup, and G. E. Karniadakis , Gappy data: To Krig or not to Krig? , Journal ofComputational Physics, 212 (2006), pp. 358–382.[13]
T. Hastie, R. Tibshirani, and J. H. Friedman , The elements of statistical learning , Springer, 2009.[14]
K. He, X. Zhang, S. Ren, and J. Sun , Deep residual learning for image recognition , 2016.[15]
J. S. Hesthaven and T. Warburton , Nodal discontinuous Galerkin methods: algorithms, analysis,and applications , vol. 54, Springer Verlag New York, 2008.[16]
G. E. Hinton and R. R. Salakhutdinov , Reducing the dimensionality of data with neural networks ,Science, 313 (2006), pp. 504–507.[17]
P. Holmes, J. Lumley, and G. Berkooz , Turbulence, Coherent Structures, Dynamical Systems andSymmetry , Cambridge University Press, 1996.[18]
H. T. Huynh , A flux reconstruction approach to high-order schemes including discontinuous galerkinmethods , in 18th AIAA Computational Fluid Dynamics Conference, 2007, p. 4079.[19]
S. Ioffe and C. Szegedy , Batch normalization: Accelerating deep network training by reducing internalcovariate shift , arXiv preprint arXiv:1502.03167, (2015).[20]
H. Kang, D. Lee, and D. Lee , A study on CFD data compression using hybrid supercompact wavelets ,KSME international journal, 17 (2003), pp. 1784–1792.[21]
M. Karl, M. Soelch, J. Bayer, and P. van der Smagt , Deep variational Bayes filters: Unsupervisedlearning of state space models from raw data , arXiv preprint arXiv:1605.06432, (2016).[22]
Y. Kawahara , Dynamic mode decomposition with reproducing kernels for Koopman spectral analysis ,in Advances in Neural Information Processing Systems, 2016, pp. 911–919.[23]
C. A. Kennedy, M. H. Carpenter, and R. M. Lewis , Low-storage, explicit Runge–Kutta schemesfor the compressible Navier–Stokes equations , Applied Numerical Mathematics, 35 (1999), pp. 177–219.2324]
D. Kingma and J. Ba , Adam: A method for stochastic optimization , arXiv preprint arXiv:1412.6980,(2014).[25]
D. P. Kingma and M. Welling , Auto-encoding variational Bayes , arXiv preprint arXiv: 1312.6114,(2013).[26]
D. A. Kopriva , A staggered-grid multidomain spectral method for the compressible Navier–Stokesequations , Journal of Computational Physics, 143 (1998), pp. 125–158.[27]
S. Kumarasamy and J. B. Barlow , Computation of unsteady flow over a half-cylinder close to amoving wall , Journal of wind engineering and industrial aerodynamics, 69 (1997), pp. 239–248.[28]
T. Lesort, N. Díaz-Rodríguez, J.-F. Goudou, and D. Filliat , State representation learning forcontrol: An overview , Neural Networks, (2018).[29]
Y. Liu, M. Vinokur, and Z. Wang , Spectral difference method for unstructured grids I: basicformulation , Journal of Computational Physics, 216 (2006), pp. 780–801.[30]
B. Lusch, J. N. Kutz, and S. L. Brunton , Deep learning for universal linear embeddings of nonlineardynamics , arXiv preprint arXiv:1712.09707, (2017).[31]
J. Morton, F. D. Witherden, A. Jameson, and M. J. Kochenderfer , Deep dynamical modelingand control of unsteady fluid flows , 2018.[32]
V. Nair and G. E. Hinton , Rectified linear units improve restricted Boltzmann machines , in Proceedingsof the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.[33]
Y. Nakamura , Vortex shedding from bluff bodies with splitter plates , Journal of Fluids and Structures,10 (1996), pp. 147–158.[34]
S. E. Otto and C. W. Rowley , Linearly-recurrent autoencoder networks for learning dynamics , arXivpreprint arXiv:1712.01378, (2017).[35]
J. Park, F. Witherden, and P. Vincent , High-order implicit large-eddy simulations of flow over aNACA0021 aerofoil , AIAA Journal, (2017), pp. 2186–2197.[36]
P. Parnaudeau, J. Carlier, D. Heitz, and E. Lamballais , Experimental and numerical studies ofthe flow over a circular cylinder at reynolds number 3900 , Physics of Fluids, 20 (2008), p. 085101.[37]
S. C. Puligilla and B. Jayaraman , Deep multilayer convolution frameworks for data-driven learningof fluid flow dynamics , in 2018 Fluid Dynamics Conference, 2018, p. 3091.[38]
W. H. Reed and T. R. Hill , Triangular mesh methods for the neutron transport equation , TechnicalReport LA-UR-73-479, Los Alamos Scientific Laboratory, (1973).[39]
R. Sakai, D. Sasaki, and K. Nakahashi , Parallel implementation of large-scale CFD data compressiontoward aeroacoustic analysis , Computers & Fluids, 80 (2013), pp. 116–127.[40]
A. Santa Cruz, L. David, J. Pecheux, and A. Texier , Characterization by proper-orthogonal-decomposition of the passive controlled wake flow downstream of a half cylinder , Experiments in Fluids,39 (2005), pp. 730–742.[41]
P. J. Schmid , Dynamic mode decomposition of numerical and experimental data , Journal of fluidmechanics, 656 (2010), pp. 5–28.[42]
A. J. Smola and B. Schölkopf , A tutorial on support vector regression , Statistics and Computing,14 (2004), pp. 199–222. 2443]
Y. Sun, Z. J. Wang, and Y. Liu , High-order multidomain spectral difference method for the Navier–Stokes equations on unstructured hexahedral grids , Communications in Computational Physics, 2 (2007),pp. 310–333.[44]
N. Takeishi, Y. Kawahara, and T. Yairi , Learning Koopman invariant subspaces for dynamic modedecomposition , in Advances in Neural Information Processing Systems, 2017, pp. 1130–1140.[45]
L. Theis, W. Shi, A. Cunningham, and F. Huszár , Lossy image compression with compressiveautoencoders , in International Conference on Learning Representations, 2017.[46]
R. Tibshirani , Regression shrinkage and selection via the lasso , Journal of the Royal Statistical Society.Series B (Methodological), (1996), pp. 267–288.[47]
A. Trott, R. Moorhead, and J. McGinley , Wavelets applied to loseless compression and progressivetransmission of floating point data in 3-d curvilinear grids , in Proceedings of the 7th conference onVisualization’96, IEEE Computer Society Press, 1996, pp. 385–ff.[48]
D. Venturi and G. E. Karniadakis , Gappy data and reconstruction procedures for flow past acylinder , Journal of Fluid Mechanics, 519 (2004), pp. 315–336.[49]
B. C. Vermeire, F. D. Witherden, and P. E. Vincent , On the utility of GPU accelerated high-order methods for unsteady flow simulations: A comparison with industry-standard tools , Journal ofComputational Physics, 334 (2017), pp. 497–521.[50]
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol , Stacked denoisingautoencoders: Learning useful representations in a deep network with a local denoising criterion , 11(2010), pp. 3371–3408.[51]
P. Vincent, F. Witherden, B. Vermeire, J. S. Park, and A. Iyer , Towards green aviation withPython at petascale , in Proceedings of the International Conference for High Performance Computing,Networking, Storage and Analysis, IEEE Press, 2016, p. 1.[52]
P. E. Vincent and A. Jameson , Facilitating the adoption of unstructured high-order methods amongsta wider community of fluid dynamicists , Mathematical Modelling of Natural Phenomena, 6 (2011),pp. 97–140.[53]
M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller , Embed to control: A locallylinear latent dynamics model for control from raw images , in Advances in neural information processingsystems, 2015, pp. 2746–2754.[54]
S. Wiewel, M. Becher, and N. Thuerey , Latent-space physics: Towards learning the temporalevolution of fluid flow , arXiv preprint arXiv:1802.10123, (2018).[55]
K. Willcox , Unsteady flow sensing and estimation via the gappy proper orthogonal decomposition ,Computers & fluids, 35 (2006), pp. 208–226.[56]
M. O. Williams, I. G. Kevrekidis, and C. W. Rowley , A data–driven approximation of thekoopman operator: Extending dynamic mode decomposition , Journal of Nonlinear Science, 25 (2015),pp. 1307–1346.[57]
M. O. Williams, C. W. Rowley, and I. G. Kevrekidis , A kernel-based approach to data-drivenKoopman spectral analysis , arXiv preprint arXiv:1411.2260, (2014).[58]
D. Wirtz and B. Haasdonk , A vectorial kernel orthogonal greedy algorithm , Dolomites ResearchNotes on Approximation, 6 (2013).[59]
D. Wirtz, N. Karajan, and B. Haasdonk , Surrogate modeling of multiscale models using kernelmethods , International Journal for Numerical Methods in Engineering, 101 (2015), pp. 1–28.2560]
F. D. Witherden, B. C. Vermeire, and P. E. Vincent , Heterogeneous computing on mixedunstructured grids with PyFR , Computers & Fluids, 120 (2015), pp. 173–186.[61]