[PDF] Pressio: Enabling projection-based model reduction for large-scale nonlinear dynamical systems

Abstract

This work introduces Pressio, an open-source project aimed at enabling leading-edge projection-based reduced order models (ROMs) for large-scale nonlinear dynamical systems in science and engineering. Pressio provides model-reduction methods that can reduce both the number of spatial and temporal degrees of freedom for any dynamical system expressible as a system of parameterized ordinary differential equations (ODEs). We leverage this simple, expressive mathematical framework as a pivotal design choice to enable a minimal application programming interface (API) that is natural to dynamical systems. The core component of Pressio is a C++11 header-only library that leverages generic programming to support applications with arbitrary data types and arbitrarily complex programming models. This is complemented with Python bindings to expose these C++ functionalities to Python users with negligible overhead and no user-required binding code. We discuss the distinguishing characteristics of Pressio relative to existing model-reduction libraries, outline its key design features, describe how the user interacts with it, and present two test cases---including one with over 20 million degrees of freedom---that highlight the performance results of Pressio and illustrate the breath of problems that can be addressed with it.

Full PDF

PPRESSIO: ENABLING PROJECTION-BASED MODEL REDUCTIONFOR LARGE-SCALE NONLINEAR DYNAMICAL SYSTEMS ∗ FRANCESCO RIZZI † , PATRICK J. BLONIGAN ‡ , AND

KEVIN T. CARLBERG ‡§ Abstract.

This work introduces

Pressio , an open-source project aimed at enabling leading-edge projection-based reduced order models (ROMs) for large-scale nonlinear dynamical systems inscience and engineering.

Pressio provides model-reduction methods that can reduce both the num-ber of spatial and temporal degrees of freedom for any dynamical system expressible as a system ofparameterized ordinary diﬀerential equations (ODEs). We leverage this simple, expressive mathe-matical framework as a pivotal design choice to enable a minimal application programming interface(API) that is natural to dynamical systems. The core component of

Pressio is a C++11 header-onlylibrary that leverages generic programming to support applications with arbitrary data types andarbitrarily complex programming models. This is complemented with Python bindings to exposethese C++ functionalities to Python users with negligible overhead and no user-required bindingcode. We discuss the distinguishing characteristics of

Pressio relative to existing model-reductionlibraries, outline its key design features, describe how the user interacts with it, and present two testcases—including one with over 20 million degrees of freedom—that highlight the performance resultsof

Pressio and illustrate the breath of problems that can be addressed with it.

Key words. projection-based model reduction, Galerkin, LSPG, POD, SVD, sample mesh,hyper-reduction, scientiﬁc computing, object-oriented programming, generic programming, policy-based design, design by introspection, template metaprogramming, HPC, Kokkos, Trilinos, Python.

AMS subject classiﬁcations.

1. Introduction.

Computational science and engineering has become a funda-mental driver of the technological advancement of modern society, underpinning appli-cations such as scientiﬁc discovery, engineering design, vehicle control, infrastructuremonitoring, and risk assessment of engineered systems. Computational science andengineering and high-performance computing have a symbiotic relationship: the con-stant improvement and availability of computing power drives the ambition to modelincreasingly complex systems with increasing levels of ﬁdelity, which, in turn, moti-vates the pursuit of more powerful computing platforms.The present “extreme-scale” computing era—often referred to as the dawn ofexascale computing—is one of unprecedented access to computational resources. Inprinciple, this enables scientists and engineers to tackle very complex problems thatare time critical or many query in nature. Time-critical problems are those con-strained by a limited wall-clock time, and include fast-turnaround design, rapid pathplanning, and model predictive control, which require simulations to execute on the or-der of hours, minutes, or milliseconds, respectively. In contrast, many-query problemsare characterized by a limited budget of computational core-hours. Many-query prob-lems pervade uncertainty quantiﬁcation (UQ), as such applications typically requirethousands of model evaluations to characterize adequately the eﬀect of uncertaintiesin parameters and operating conditions on the system response. In the case of bothtime-critical and many-query problems, if the system of interest is computationally ∗ Submitted to the editors February 6th, 2020. † NexGen Analytics, 30N Gould St. Ste 5912, Sheridan, WY, 82801, USA ([email protected], [email protected]). ‡ Department of Extreme-Scale Data Science and Analytics, Sandia National Laboratories, Liver-more, CA, 94550, USA ([email protected], [email protected]). § Departments of Applied Mathematics and Mechanical Engineering, University of Washington,Seattle, WA 98195, USA 1 a r X i v : . [ c s . M S ] A p r F. RIZZI, P. J. BLONIGAN, AND K. T. CARLBERG expensive to query—as in the case of high-ﬁdelity models for which a single run canconsume days or weeks on a supercomputer—the constraint on wall-clock time orcore-hours can be impossible to satisfy. In such scenarios, analysts often turn to sur-rogate models, which replace the high-ﬁdelity model with a lower-cost, lower-ﬁdelitycounterpart that can be employed within the application to make satisfaction of thewall-clock-time or core-hour constraint feasible. In this work, we consider the high-ﬁdelity model of interest to be a nonlinear dynamical system expressible a system ofparameterized ordinary diﬀerential equations (ODEs).Broadly speaking, surrogate models can be classiﬁed under three categories,namely (a) data ﬁts , which construct an explicit mapping (e.g., using polynomi-als, Gaussian processes) from the system’s parameters (i.e., inputs) to the systemresponse of interest (i.e., outputs), (b) lower-ﬁdelity models , which simplify the high-ﬁdelity model (e.g., by coarsening the mesh, employing a lower ﬁnite-element order,or neglecting physics), and (c) projection-based reduced-order models (ROMs) , whichreduce the number of degrees of freedom in the high-ﬁdelity model though a pro-jection process. In general, data ﬁts and lower-ﬁdelity models are widely consideredmuch easier to implement than ROMs, as they generally do not require any modiﬁca-tions to the underlying simulation code; however, because ROMs apply a projectionprocess directly to the equations governing the high-ﬁdelity model, one can often makemuch stronger performance guarantees (e.g., of structure preservation, of accuracy viaadaptivity) and perform more accurate a posteriori error analysis (e.g., via a poste-riori error bounds or error models) with ROMs. Despite these beneﬁts, the practicalchallenges of implementing nonlinear model-reduction techniques in large-scale codesoften precludes their adoption in practice; this occurs because na¨ıve implementationsrequire modifying low-level operations and solvers for each simulation code of inter-est. This implementation strategy is simply not practical or sustainable in manymodern settings, because industrial simulation codes often evolve rapidly, institutionsmay employ dozens of simulation codes for diﬀerent analyses, and commercial codestypically do not expose the required low-level operators and solvers. This challengeposes perhaps the single largest barrier to widespread adoption of model reduction inindustrial applications.To this end, this paper presents an open-source project that aims to mitigatethe implementation burden of nonlinear model reduction in large-scale applicationswithout compromising performance. We organize the introduction by ﬁrst discussingexisting packages for ROMs, some of their limitations, and then describe in detailwhat we propose.

Overview of the current software landscape.

As previously mentioned, froma software standpoint, most approaches developed so far for ROMs have relied onintrusive implementations into each code of interest, see, e.g., the work in [26, 27, 17,21, 38, 39, 36, 20]. This is impractical for two main reasons: ﬁrst, each existing methodneeds to be re-implemented in any application adapting to various data structure andmodel; second, any time a new ROM algorithm is designed, it would require a separateimplementation within each individual code. To overcome this barrier, recent workssuch as modred [5], libROM , pyMOR [31], and pyROM [34] have sought toprovide frameworks for projection-based ROMs and related capabilities that can beused by external codes without requiring them to add model-reduction-speciﬁc code. libROM is a C++ library mainly focused on providing functionalities to compute https://github.com/LLNL/libROMROJECTION-BASED MODEL REDUCTION VIA GENERIC PROGRAMMING libROM and must be done by the user, we do not furtherdiscuss this library. modred [5] is a Python library containing implementations of Proper OrthogonalDecomposition (POD) [23], balanced POD (BPOD) [42, 35], Petrov–Galerkin projec-tion (limited only to linear systems), and Dynamic Mode Decomposition (DMD) [37].To use modred , the user needs to provide (a) a vector object supporting additionand scalar multiplication; (b) a function to compute inner products; and (c) a vectorhandle class with a get and a set method. The authors of modred state in [5] thatthe overhead is small, and if performance becomes a problem, the user is advised towrite required kernels in a compiled language (e.g., C/C++ and Fortran) and wrapthem into Python modules for increased speed. pyMOR [31] is a Python library under active development, providing reduced-basis (RB) capabilities and aims to be user friendly and easily integrable with third-party partial-diﬀerential-equations (PDE) solvers. It is similar in concept to modred ,but it provides a more complete and robust framework. Its design is based on theobservation that all high-dimensional operations in RB methods can be expressed interms of a set of elementary linear algebra operations. pyMOR can be seen “as acollection of algorithms operating on VectorArray , Operator , and

Discretization objects”, where the

Discretization object stores the structure of a discrete problemin terms of its operators. pyMOR oﬀers two main ways to be used by an application.One requires the application to output data to disk (or other persistent storage)from which pyMOR reads/loads what is needed. The alternative (and preferred)way requires the application to be re-compiled as a Python extension module, thusallowing pyMOR to access directly the application’s data structures. pyMOR hasvaluable capabilities, e.g., its own basic discretization toolkit, the ability to run thefull oﬄine phase. pyROM [34] is a fairly recent Python framework supporting only Galerkin pro-jection with a basis obtained via POD or DMD. The work explicitly cites pyMOR [31],but it does not clearly articulate how it distinguishes itself from pyMOR . The articlestates that the main operating protocol is one where pyROM reads data from user-supplied ﬁles; this approach is clearly not feasible for large-scale applications, whereI/O is extremely expensive. Further, the work does not articulate how pyROM caninterface with application codes written in languages other than Python.

Limitations of current approaches.

We now describe our viewpoint on someweaknesses of the frameworks introduced above.

Limited to continuum systems: modred , pyMOR and pyROM assume an ap-plication to be expressed as a system of PDEs, and formulate their ROM methodsbased on the discrete system stemming from the spatial discretization of the governingPDEs. This is applicable to continuum models, but not to naturally discrete systemssuch as particle models. Limited to spatial reduction: modred , pyMOR and pyROM support modelreduction methods that reduce only the number of spatial degrees of freedoms. Theydo not support space–time techniques that enable a reduction of the number of tem-poral degrees of freedom [40, 43, 4, 16], which is particularly important for long-time-integration problems. F. RIZZI, P. J. BLONIGAN, AND K. T. CARLBERG

Limited to linear subspaces: modred , pyMOR and pyROM focus only on linear subspaces, without discussing potential support for nonlinear manifolds. This raisesquestions as to the extensibility of these libraries to promising recent model-reductiontechniques that operate on nonlinear manifolds [19, 28]. Python as the core development language: modred , pyMOR and pyROM em-ploy Python as the core development language. This seems counterintuitive, as—forperformance reasons—most Python packages such as CPython, NumPy and SciPyare natively written in C, and bindings are provided to expose them to Python. Thisapproach is also being increasingly adopted by large-scale numerical libraries, e.g.,PETSc[2, 3], Trilinos . Furthermore, the authors of modred and pyMOR them-selves explicitly advise users (regardless of the application’s language) to write com-putational kernels in a compiled language, and create wrappers to use within Python.We believe, and discuss in more detail later, that a more advantageous approachwould be to use a compiled language as the core, and to create bindings to exposethese functionalities in a interpreted, high-level programming language. Need for application-speciﬁc Python bindings:

Related to the item above, modred , pyMOR and pyROM require application-speciﬁc Python bindings for any applica-tion written in a compiled language like C/C++/Fortran or any other non-Pythonlanguage. All three libraries require bindings to wrap the application’s data struc-tures. In addition to this, pyMOR requires bindings to wrap the PDE-discretizationoperators. We believe this approach can be problematic for several reasons. First,the development of these Python bindings may be diﬃcult or even unfeasible. This isespecially true for legacy codes and more modern high performance computing (HPC)-oriented packages that use complex C++ templates and structures, like OCCA[30] orKokkos [14] . This is critical because, for performance reasons, most HPC applicationsand libraries targeting large-scale simulations are written in C, C++, or Fortran usingvarious types of data structures. Examples include PETSc, Trilinos, Chombo, ScaLA-PACK, ALGLIB, Blaze, OpenFOAM, OCCA, UPC++, mlpack, NAMD, Charm++,LAMMPS, Espresso++, Gromacs, AMReX. While pyMOR developers oﬀer help tointegrate external PDE solvers, this can remain an obstacle for productivity due tothe turnaround time of this process. The authors of pyMOR [31] aim to overcomethis issue by shipping directly with the library suitable pre-made Python bindingsfor FEniCS, deal.II, Dune, and NGSolve. However, these pre-made bindings imposemaintainability constraints to pyMOR developers, as they need to keep them up-to-date with the advancement of the corresponding libraries. Second, beside the technicalchallenge, the burden of writing bindings in a language diﬀerent than the application’snative one could pose a productivity barrier to HPC developers. Given the choice,we believe developers would typically favor having access to libraries written in thelanguage native to their application. Limited applicability to modern C++ with static polymorphism:

Modern C++applications and libraries leverage compile-time (or static) polymorphism. A criticalexample of its use is for performance portability purposes, see e.g., OCCA, Kokkosand SYCL. These programming models are eﬀectively built as C++ extensions, andaimed at alleviate the user from needing to know architectures-speciﬁc details towrite performant code. They expose a general API allowing users to write code onlyonce, and internally exploit compile-time polymorphism to decide on the proper datalayout, and instantiate the suitable backend depending on the target threaded system, https://github.com/trilinos https://github.com/kokkosROJECTION-BASED MODEL REDUCTION VIA GENERIC PROGRAMMING Programming model limitations: modred , pyMOR and pyROM omit any dis-cussion and demonstration of their interoperability with applications exploiting emerg-ing programming models, e.g., fully distributed task-based models and hybrid MPI+X,where X represents shared memory parallelism via threads, vectorization, tasking orparallel loop constructs. We believe that supporting interoperability with these pro-gramming models is a critical feature, because these have arisen as the leading com-petitors to address challenges due to hybrid architectures, increased parallelism andmemory hierarchies, and will become increasingly more important. What we propose.

In this paper, we present

Pressio , a computational frame-work aimed at enabling projection-based model reduction for large-scale nonlinear dy-namical systems. We highlight here its main features and how we believe it addressesthe aforementioned limitations of existing frameworks for model reduction. Applicable to a general ODE systems:

Pressio provides ROM capabilities thatare applicable to any system expressible as a parameterized system of ordinary dif-ferential equations (ODEs) as(1.1) d x dt = f ( x , t ; µ ) , x (0; µ ) = x ( µ ) , where x : [0 , T ] × D → R N denotes the state, x : D → R N denotes the initialstate, µ ∈ D ⊆ R n µ denotes the system parameters, f : R N × [0 , T ] × D → R N with f : ( ξ , τ, ν ) f ( ξ , τ, ν ) denotes the velocity that may be linear or nonlinearin its ﬁrst argument, and t ∈ [0 , T ] denotes time with T >

Minimal API:

An application needs to satisfy a minimal application program-ming interface (API). This consists of exposing, for a given state x , time t , andparameters µ , the velocity vector f ( x , t ; µ ) and the action of the Jacobian matrix ∂ f ( x , t ; µ ) /∂ ξ on a dense matrix. Methods:

Pressio supports model-reduction methods applicable to both steadyand unsteady problems. For the latter, in particular, it provides algorithms to re-duce both the spatial and temporal degrees of freedom. A detailed discussion of thesupported methods is provided in § Trial manifold abstraction:

We exploit software design and abstractions to gener-alize the concept of the trial manifold onto which the full-order model (FOM) system isprojected, yielding support for both linear trial manifolds (i.e., linear trial subspaces)and nonlinear trial manifolds.

Suitable for complex nonlinear problems:

Pressio provides ROM capabilitiesthat are meant for complex nonlinear problems, including the least-squares Petrov–Galerkin (LSPG) projection technique [10, 9]. This technique exhibits advantagesover the popular Galerkin projection because it (a) guarantees optimality of the time-discrete ROM, (b) yields smaller discrete-time a posteriori error bounds, (c) andoften yields improved accuracy. These properties have been demonstrated for un-steady CFD applications [11, 9]. Furthermore, the LSPG method has been applied https://github.com/Pressio F. RIZZI, P. J. BLONIGAN, AND K. T. CARLBERG successfully to highly nonlinear problems including transonic and hypersonic CFDcases [41, 6].

C++ core:

Pressio is built on a C++(11) core, following a model consistentwith the one used by CPython, NumPy, SciPy, as well as other HPC libraries likeTrilinos and PETSc. This approach enables us to have direct integration with nativeC/C++ applications, and to build multiple front ends of various languages all relyingon a single, compiled, optimized, and robust backend. One could argue that Fortrancould have been another choice, which would bring us into a historical debate betweenFortran and C/C++. We opted for C++ because (a) a substantial subset of the large-scale codes across the United States national laboratories is implemented in C++,and (b) C++ seems better suited than Fortran to design complex data structures andabstractions.

No bindings for C++ applications:

Any C++ application can use

Pressio simplyas a regular library, thus avoiding any bindings. This is diﬀerent than modred , pyMOR and pyROM , which require any C++ application to develop individualPython bindings. Eliminating the need for bindings beneﬁts productivity which,in turn, maximizes the impact and appeal of Pressio not only across the nationallaboratories, but also on any other scientiﬁc code written in C++. We remark that thisdoes not mean that any C++ application “magically” works with

Pressio : as statedpreviously, an application must satisfy a minimal but speciﬁc API, which is discussedin § within the application itself using the same programminglanguage. Hence, in any scenario, users and application developers operate withintheir domain of expertise, without needing to resort to a diﬀerent language to interfaceto Pressio . Compile-time beneﬁts and easy deployment:

Pressio is a header-only library writ-ten in C++(11), allowing us to fully exploit metaprogramming and perform extensivechecks at compile time . Being header only, it does not need to be compiled, so theuser can simply point to the source code, yielding a straightforward deployment.

Seamless interoperability with arbitrary programming models:

Pressio is designedwithout making any assumption on the programming model or data types used byan application. Hence, it seamlessly supports arbitrary data types and arbitrarilycomplex programming models, e.g., MPI+X with X being a graphics processing unit(GPU), OpenMP or local tasking.

No binding code for Python applications:

We leverage the modern, C++11 library pybind11 to create Python bindings to Pressio . This enables any user having anapplication fully written in Python to use the C++ backend of

Pressio to run ROMs.No user-deﬁned bindings are needed (see § once , which were then immediately usable by any Pythonapplication. Second, this exempts Python users from having to write any binding tointerface with

Pressio . What are the drawbacks of Pressio?

First, being written in C++, it isnot directly usable in Fortran applications. However, as mentioned above, this canbe addressed by writing Fortran bindings, as it has been done for other libraries, https://github.com/pybind/pybind11ROJECTION-BASED MODEL REDUCTION VIA GENERIC PROGRAMMING (which uses Babel to write its Fortran interface), and ForTrilinos (whichis based on SWIG ).Second, the use of metaprogramming might constitute a barrier to entry for de-velopers who are not familiar with this technique. However, considering that genericprogramming is becoming increasingly more important and widespread, it is a usefulskill to learn. Hence, working on Pressio can be advantageous for developers and, atthe same time, beneﬁcial to the actual code development.Third, by relying on a minimal, ODE-oriented API, when a new model-reductiontechnique is developed, one needs to ensure (if possible) that its formulation is compat-ible with the same API. This might pose some limitations, e.g., for certain structure-preserving methods that leverage the underlying second-order or Hamiltonian struc-ture [15] or the underlying ﬁnite-element [18], ﬁnite-volume [13], or discontinuousGalerkin [44] spatial discretization. In the worst case scenario, this can be addressedby augmenting the API for

Pressio for new developments; indeed, we have alreadyprovided the required extension to enable conservative LSPG [12] for ﬁnite-volumemodels.

Paper organization.

The paper is organized as follows. In §

2, we describe theformulation of the ROMs currently supported in

Pressio , as well as those planned forthe next release. In §

3, we discuss the implementation, the application programminginterface ( § § § §

4, and conclusions in §

2. Reduced-Order Models Formulation.

This section presents the mathe-matical formulation of the ROMs currently supported in

Pressio , as well as somedetails about those scheduled for integration in upcoming releases.We consider the original computational model—which we refer to as the full-ordermodel (FOM)—to be a dynamical system expressible as a system of parameterizedODEs of the form (1.1), which can be expressed equivalently as(2.1) r ( ˙ x , x , t ; µ ) = , x (0; µ ) = x ( µ ) , where r : R N × R N × [0 , T ] × D → R N denotes the (time-continuous) residual functiondeﬁned as r := ˙ x − f ( x , t ; µ ). Projection-based model-reduction methods seek an ap-proximate solution ˜ x ( ≈ x ) of the form(2.2) ˜ x ( t ; µ ) = x ref ( µ ) + g (ˆ x ( t ; µ )) , where ˜ x : [0 , T ] × D → x ref ( µ ) + M ⊆ R N and M := { g (ˆ ξ ) | ˆ ξ ∈ R p } ⊆ R N denotesthe (linear or nonlinear) trial manifold of R N . Here, x ref : D → R N denotes aparameterized reference state and g : ˆ ξ g (ˆ ξ ) with g : R p → R N and p ≤ N denotesthe parameterization function mapping low-dimensional generalized coordinates ˆ x tothe high-dimensional state approximation ˜ x . Henceforth, we will refer to this mappingas the “decoder” function, consistently with terminology employed in [28], where thisfunction associated with a convolutional decoder. The approximate velocity can beobtained via chain rule yielding(2.3) ˙˜ x ( t ; µ ) = J (ˆ x ( t ; µ )) ˙ˆ x ( t ; µ ) , https://github.com/hypre-space/hypre https://computing.llnl.gov/projects/babel-high-performance-language-interoperability https://github.com/trilinos/ForTrilinos F. RIZZI, P. J. BLONIGAN, AND K. T. CARLBERG where J : ˆ ξ d g d ˆ ξ (ˆ ξ ) with J : R p → R N × p denoting the Jacobian of the decoder. Linear trial manifold.

When g is a linear mapping, one recovers the classicalaﬃne trial subspace. In this case, the decoder can be expressed as g : ˆ ξ Φ ˆ ξ forsome trial basis matrix Φ ∈ R N × p? , where R n × m? denotes the set of full-column-rank n × m matrices. In this case, the trial manifold is aﬃne with M = Ran( Φ ) and theJacobian a constant matrix J (ˆ x ( t ; µ )) = J = Φ , leading to the following expressionof the approximate state and velocity(2.4) ˜ x ( t ; µ ) = x ( µ ) + Φ ˆ x ( t ; µ ) and ˙˜ x ( t ; µ ) = Φ ˙ˆ x ( t ; µ ) . Manifold Galerkin projection as proposed in[28] can be derived by minimizing the time-continuous residual over the trial manifold.The resulting model can be obtained by substituting the approximate state (2.2) andcorresponding velocity (2.3) into Eq. (2.1), and minimizing the (weighted) ‘ -norm ofthe resulting residual, which yields the ODE system(2.5) ˙ˆ x ( t ; µ ) = arg min ˆ v ∈ R p k Ar ( J (ˆ x ( t ; µ ))ˆ v , x ref ( µ ) + g (ˆ x ( t ; µ )) , t ; µ ) k , which can be written equivalently as(2.6) ˙ˆ x ( t ; µ ) = ( AJ (ˆ x ( t ; µ )) + Af ( x ref ( µ ) + g (ˆ x ( t ; µ )) , t ; µ ) , where the superscript + denotes the Moore–Penrose pseudoinverse, ˆ x (0; µ ) = ˆ x ( µ )is the reduced initial condition (e.g., computed by performing orthogonal projectionof the x ( µ ) onto the trial manifold [28]), and A ∈ R z × N with z ≤ N is a weightingmatrix that enables the deﬁnition of a weighted (semi)norm. The ODE system (2.6)can be integrated in time using any time stepping scheme.If we select the N × N identity matrix, i.e. A = I , the projection is typicallyinsuﬃcient to yield computational savings when the velocity is nonlinear in its ﬁrstargument, as the N -dimensional nonlinear operators must be repeatedly computed,thus precluding an N -independent online operation count. To overcome this compu-tational bottleneck, one can choose A to have a small number of nonzero columns.When the residual Jacobian is sparse, this sparsity pattern leads to “hyper-reduction”,which yields an N -independent online operation count by approximating the nonlinearoperators by computing a small ( N -independent) subset of their elements. Examplesinclude collocation, i.e., A = P with P comprising selected rows of the identity ma-trix, and the GNAT method [10, 11], i.e., A = ( P Φ r ) + P with Φ r ∈ R N × z? comprisinga basis matrix for the residual.Manifold Galerkin is currently supported in Pressio for both linear and nonlineartrial manifolds. For the linear case, the user needs to provide the constant basis matrix Φ . For the nonlinear case, the user is responsible for implementing the decodermapping g , which Pressio uses without knowledge of its implementation details. Ina future release, we plan to add support for a set of default choices for nonlinearmappings, e.g., convolutional autoencoders as in [28]. For the weighting matrix, wecurrently support A = I , the collocation-based hyper-reduction A = P , and anarbitrary weighting matrix speciﬁed by the user, while the GNAT-based weightingmatrix A = ( P Φ r ) + P is under development. In contrastto Galerkin projection, LSPG projection corresponds to minimizing the (weighted) ‘ -norm of the time-discrete residual over the trial manifold. Hence, the starting point for ROJECTION-BASED MODEL REDUCTION VIA GENERIC PROGRAMMING implicit linear multistep methods fortwo primary reasons: ﬁrst, Galerkin and LSPG are equivalent for explicit schemes [9];second, at the time of this writing, implicit linear multistep schemes are the mainimplicit schemes supported in

Pressio . Other families of time stepping schemes,e.g., diagonally implicit Runge–Kutta, will be supported in future developments; [9]describes LSPG for Runge–Kutta schemes in detail.A linear k -step method applied to numerically solve the problem (1.1) leads tosolving a sequence of systems of algebraic equations(2.7) r n ( x n ; µ ) = , n = 1 , . . . , N t , where N t denotes the total number of instances, x n denotes the state at the n th timeinstance, and r n : R N × D → R N denotes the time-discrete residual. At the n -th timeinstance, we can then write: α x n − ∆ tβ f ( x n , t n ; µ ) k X j =1 α j x n − j ∆ t k X j =1 β j f ( x n − j , t n − j ; µ ) = 0 , (2.8)where ∆ t ∈ R + denotes the time step, x k denotes the numerical approximation to x ( k ∆ t ; µ ), the coeﬃcients α j and β j , j = 0 , . . . , k with P kj =0 α j = 0 deﬁne a partic-ular multistep scheme, and β = 0 is required for implicit schemes. For simplicity, weassume a uniform time step ∆ t and a ﬁxed number of steps k for each time instance,but extensions to nonuniform time grids and a non-constant value of k are straight-forward. For example, setting k = 1 and α = β = 1, α = − β = 0, yields thebackward Euler method.LSPG projection is derived by substituting the approximate state (2.2) in thetime-discrete residual (2.7) and minimizing its (weighted) ‘ -norm, which yields asequence of residual-minimization problems(2.9) ˆ x n ( µ ) = arg min ˆ ξ ∈ R p (cid:13)(cid:13)(cid:13) Ar n (cid:16) x ref ( µ ) + g (ˆ ξ ); µ ) (cid:17)(cid:13)(cid:13)(cid:13) , n = 1 , . . . , N t with initial condition ˆ x (0; µ ) = ˆ x ( µ ). As in § A ∈ R z × N with z ≤ N denotes aweighting matrix that can enable hyper-reduction. LSPG formulation for stationary problems.

Manifold LSPG can be applied tosteady-state problems by computing x ( µ ) as the solution to Eq. (1.1) with ˙ x = ,which yields(2.10) f ( x ; µ ) = , where the time-independent velocity is equivalent to the residual of the stationary-problem. Substituting the approximate state (2.2) into Eq. (2.10) and minimizing the ‘ -norm of the residual yields(2.11) ˆ x ( µ ) = arg min ˆ ξ ∈ R p (cid:13)(cid:13)(cid:13) Af (cid:16) x ref ( µ ) + g (ˆ ξ ); µ ) (cid:17)(cid:13)(cid:13)(cid:13) , which has the same nonlinear-least-squares form obtained for a time-dependent prob-lem shown in Eq. (2.9).0 F. RIZZI, P. J. BLONIGAN, AND K. T. CARLBERG

Constrained LSPG formulation.

If the minimization problems (2.9) and (2.11)are equipped with constraints, one obtains unsteady and steady constrained LSPGproblems, respectively. An example within this category of models is the conservativeLSPG formulation in [13], which ensures the reduced-order model is conservativeover subdomains of the problem in the case of ﬁnite-volume discretizations. Thismethod relies on a nonlinear equality constraint to enforce conservation. The readeris referred to [13] for additional details on conservative LSPG projection, includingfeasibility conditions of the associated optimization problems, and error bounds.

Pressio supports manifold LSPG and conservative LSPG for both steady andunsteady problems. As with manifold Galerkin projection, it supports weighting ma-trices A = I , the collocation-based hyper-reduction approach A = P , or an arbitraryweighting matrix speciﬁed by the user. Native support for the GNAT-based weightingmatrix is under development. In § § spatial dimensionality of a dynamical system; this yields a ROM thatstill needs to be resolved in time via numerical integration. In some cases, the reducedsystem can be integrated with a time step larger than the one used for the full-ordermodel, see [1]. However, this still imposes a limitation to the achievable computationalsavings. To address the time dimension directly, one can employ space–time model-reduction methods that simultaneously reduce both the number spatial and tempo-ral degrees of freedom of the ODE system. To address this need, we are currentlyimplementing in Pressio the space–time least-squares Petrov–Galerkin (ST-LSPG)projection method [16] and the windowed least-squares (WLS) method [33]. Becausethese techniques will be released in the future, we brieﬂy describe here the main ideabehind them, but omit the full mathematical details. For a complete formulation ofST-LSPG and WLS, the reader is directed to the aforementioned references.ST-LSPG relies on a low-dimensional space–time trial manifold, which—in thecase of a linear space–time trial manifold—can be obtained by computing tensordecompositions of snapshot matrices constructed from spatio-temporal state data.The method then computes discrete-optimal approximations in this space–time trialsubspace by minimizing the (weighted) ‘ -norm of the space–time residual, whichcorresponds to the concatenation of the time-discrete residual over all time steps.The windowed least-squares (WLS) [33] method is a model-reduction approachcomprising a generalization of Galerkin, LSPG, and ST-LSPG. WLS operates bysequentially minimizing the dynamical system residual within a low-dimensional trialmanifold over time windows. WLS can be formulated for two types of trial manifolds: spatial trial manifolds that reduce the spatial dimension of the full-order model, and space–time trial manifolds that reduce the spatio-temporal dimensions of the full-order model.

3. Pressio.

The core component of

Pressio is a C++(11) header-only librarythat leverages generic programming (i.e., templates) to enable support for nativeC++ applications with arbitrary data types. Metaprogramming is used for typechecking and introspection, to ensure that types are compatible with the design andrequirements. Being header-only,

Pressio does not need to be separately compiled,packaged, or installed; the user must only point to the source code. The developmentis based on the following objectives: (a) support for general computational models(data structures, problem types), (b) heterogeneous hardware (many-core, multi-core,accelerators), (c) performance, and (d) extensibility. The key advantage of having asingle framework is that when a new method is developed, it can be immediately

ROJECTION-BASED MODEL REDUCTION VIA GENERIC PROGRAMMING

Pressio . Pressio is modularized into separatepackages as follows: • mpl : metaprogramming functionalities • utils : common functionalities, e.g., I/O helpers, static constants, etc. • containers : data strutures • ops : linear algebra operations • qr : QR factorization functionalities • svd : singular value decomposition (SVD) functionalities • solvers : linear and nonlinear solvers • ode : explicit and implict time steppers and integrators • rom : reduced-order modeling methodsThe order used above is informative of the dependency structure. For example, everypackage depends on mpl , but qr depends only on mpl, utils, containers . The rom package resides at the bottom of the hierarchy and depends on the others. Eachpackage contains corresponding unit and regression tests. Thedesign of

Pressio / rom is based on the observation that all methods listed in § f , and the action of the Jacobian matrix ∂ f /∂ ξ , on an operand, B ,which is a tall, skinny matrix (e.g., the decoder Jacobian d g /d ˆ ξ ). Querying the actionof the Jacobian on an operand rather than the Jacobian itself is essential, because inmany applications the Jacobian itself cannot easily be explicitly assembled or stored.We envision two scenarios: one in which an application natively supports theabove functionalities, and one in which it does not. In the ﬁrst scenario, minimalwork (e.g., rearranging some functions in the native code) is needed to adapt theapplication to use Pressio / rom . We believe this best case scenario to be a reasonableexpectation from applications using expressive syntax, well-deﬁned abstractions andencapsulation. These software engineering practices guide developers to design codesusing proper abstractions directly expressing the target mathematical concept. Forsystems of the form (2.1), this means implementing the velocity f as a self-containedfree function or as a class method encapsulating the implementation details. This kindof abstraction and design would satisfy the best practices for software engineering andalso naturally ﬁts our expressive API model. To further emphasize this point, we re-mark that the same expressive model is being used/expected by other well-establishedlibraries, e.g., Sundials[22], SciPy.ode. If an application needs to be interfaced withthese libraries, the developers would need to meet a similar, expressive, intuitive APIdesigned for dynamical systems.When an application does not readily expose the required functionalities, the useror developer needs to expose them. This can be done, e.g., by writing an adapter classinside the application. Even for this scenario, we argue that the eﬀort needed is easilymanageable, and even useful, for the following reasons: (a) the adapter class needs tobe developed using the same language of the application; (b) to write the adapter, adeveloper/user would still operate within their domain of knowledge; and (c) once thisadapter class is complete, the application would gain (almost) seamless compatibilitywith other libraries targeting dynamical systems since most of them are based on thesame API model.Figure 1 shows a schematic of the interaction between Pressio / rom and a genericapplication. The application owns the information deﬁning the problem, i.e., theparameters µ , mesh, inputs, etc.. Pressio is used as an external library, and the in-2

F. RIZZI, P. J. BLONIGAN, AND K. T. CARLBERG mpl utils containers qr svd solvers oderom P r e ss i o int main() Adapter (if needed)

Application Code (owns µ , mesh, etc.) A pp li c a t i o nS i d e ˜ x , t, B f , ∂ f ∂ ξ B ˜ x , t, B f , ∂ f ∂ ξ B Fig. 1 . Schematic of the interaction between

Pressio and the application, highlighting elementsbelonging to

Pressio and those belonging to the application. The solid, black line ( −◦ ) indicates thatthe main function is where all the various components are used. The red arrow is used to indicateinformation/data computed by and exiting Pressio , while blue is used to denote information/datacomputed by the application and entering

Pressio . For a steady problem, the schematic would besimilar without the time dependence. teraction can be seen simply as an exchange of information: for a given approximatestate ˜ x , time t , and operand B , Pressio / rom queries the application for the asso-ciated velocity f (˜ x , t ; µ ) and Jacobian action ∂ f ∂ ξ (˜ x , t ; µ ) B . In the schematic, a redarrow indicates the information/data exiting Pressio and entering the application,while a blue arrow denotes the reverse. If the adapter is not needed, then

Pressio in-teracts directly with the core application classes. When the problem is steady, timedependency simply drops.Listing 1 shows an example of a valid C++ adapter class satisfying the APIfor Galerkin and LSPG for unsteady problems. The listing shows four type aliasingdeclarations and four methods. If any of them are missing, or the syntax shown is notmet exactly, a compile-time error is thrown. This static checking allows us to ensurethe correctness of the API. For performance reasons, the non-void methods are onlycalled once during the setup stage to copy construct the objects. More details on theoverall software design are presented in the supplementary material.All the methods are const -qualiﬁed to ensure const-correctness.

Pressio doesnot cast away the const qualiﬁcation of the adapter/application object to preserve itsvisibility. The user can choose to implement the adapter class’ methods such thatno modiﬁcation happens for the data members, or explicitly use the mutable

C++keyword. The mutable keyword allows one to qualify data members that can bemodiﬁed within those const -qualiﬁed methods. This is typically used for cases, likethis one, where one may have to mutate the state for an implementation detail, whilekeeping the visible state of an object the same.The adapter class’ methods accept arguments by reference to avoid (potentially)expensive copies of large objects. If an application relies on data structures with“view semantics”, see e.g., Kokkos or Tpetra in Trilinos, the user can simply omitthe references in the arguments of the adapter class methods. View semantics meansthat an assignment operator performs a shallow-copy operation, i.e., only the memoryreference and dimensions are copied. View semantics thus behave like the C++ shared

ROJECTION-BASED MODEL REDUCTION VIA GENERIC PROGRAMMING Listing 1

Skeleton C++ adapter class satisfying the unsteady API for Galerkin and LSPG ROMs. class SampleAdapterClass { // ... public : /* C ++11 type aliasing declarations that Pressio detects */ /* this is equivalent to doing : typedef ... scalar_type */ using scalar_type = /* application ’s scalar type */ ; using state_type = /* state type */ ; using velocity_type = /* velocity type */ ; using dense_matrix_type = /* dense matrix type */ ; // ... // compute velocity , f(x ,t ;...) , for a given state , x(t) void velocity ( const state_type & x , const scalar_type & t , velocity_type & f) const ; // given current state x(t ): // 1. compute the spatial Jacobian , df / dx // 2. compute A= df / dx *B , B is typically a skinny dense matrix void applyJacobian ( const state_type & x , const dense_matrix_type & B , const scalar_type & t , dense_matrix_type & A) const ; // overload called once to construct an initial object velocity_type velocity ( const state_type & x , const scalar_type & t) const ; // overload called once to construct an initial object dense_matrix_type applyJacobian ( const state_type & x , const dense_matrix_type & B , const scalar_type & t) const ; pointer semantics. While it would still work to accept by reference objects of typeswith view semantics, it would be more appropriate to pass them by copy to ensureproper reference counting. Support for ar-bitrary data types is enabled via generic programming and type introspection. Genericprogramming (i.e., C++ templates) implies that, implementation-wise,

Pressio cannaturally accomodate arbitrary types. Quoting from [32]: “Generic programming cen-ters around the idea of abstracting from concrete, eﬃcient algorithms to obtain genericalgorithms that can be combined with diﬀerent data representations to produce a widevariety of useful software”. Broadly speaking, we can say that

Pressio contains algo-rithms and operations written abstractly as high-level steps to operate on a generictype. This, however, is a necessary but not suﬃcient condition to enable arbitrarytypes. It is clear, in fact, that some class or functions templates performing speciﬁcoperations are applicable only to types satisfying certain conditions. This is wheretype introspection comes into play.4

F. RIZZI, P. J. BLONIGAN, AND K. T. CARLBERG

Type introspection allows the examination of a type to analyze its properties atcompile-time using metaprogramming techniques. This is the foundation of designby introspection, a programming paradigm that relies on performing introspectionon components and making decisions during compilation. For the sake of argument,think of a function template applicable only to types that can be default constructed.Introspection can be used to examine the template argument to detect if it is defaultconstructible, and diﬀerentiate code according to whether this property is satisﬁed ornot. Another example is the ability to selectively enable/disable functions or classesbased on whether the corresponding template arguments satisfy some requirements.The above approach is not exclusive to achieve support for arbitrary types. Onepossible alternative is based on inheritance and dynamic polymorphism. In this case,the library would contain speciﬁc base (or abstract) classes to represent well-deﬁnedinterfaces for target functionalities. These abstract classes are used via dynamicpolymorphism inside the algorithms. Users would be able to employ their data typesby inheriting from the base classes, thus creating new types that are fully compatiblewith the library algorithms. This is a valid approach, but has some disadvantages:(a) templates and type introspection is an approach more suitable than inheritanceto obtain type safety; (b) dynamic polymorphism has an overhead (due to a virtualtable lookup) for each function invocation; (c) type introspection fully exploits thepower of templates, which is a key feature of C++; (d) design by introspection willbecome even more natural to adopt and use when C++ concepts become part of thestandard.

Pressio performs introspection on the types instantiated by the user and, there-fore, can decide at compile-time to select specialized implementations of type-speciﬁcalgebra operations. To this end, we have implemented in

Pressio a suite of pre-deﬁned implementations of basic algebra operations for several widely-used libraries,e.g., Trilinos, Kokkos, Eigen, Armadillo. Additional implementations for other pack-ages, like PETSc and hypre, are planned for upcoming releases. Whenever possible,

Pressio exploits the implementations native to the target libraries to perform alloperations. This approach allows us to fully exploit the native performance of theselibraries beneﬁting from the substantial amount of work done to achieve optimalperformance in these libraries. If an external application uses a custom set of datastructures and operations, or if types-speciﬁc operations are not already supported in

Pressio , the user needs to provide a class implementing speciﬁc operations needed tocarry out all of the necessary ROM computations, e.g., dot products, matrix-vectorand matrix-matrix products. One advantage of this approach is that, if needed, theuser can “plug-and-play” and test various implementations of the low-level algebraoperations.While this approach might resemble the one in modred , pyMOR and pyROM ,we highlight the following diﬀerences. modred , pyMOR and pyROM requireC/C++/Fortran users to write custom Python—which is not the native applica-tion’s language—bindings to wrap their application’s data structures. In addition, pyMOR needs bindings to wrap the application’s PDE operators. Pressio , on thecontrary, does not need bindings for the application’s data structures, since these justbecome template arguments. Also,

Pressio only requires (in certain cases) a singleclass containing a small set of basic algebra operations. We stress that this class withalgebra operations requires little eﬀort from the user, because (a) it is written in theapplication’s native language; and (b) it simply involves forwarding calls to opera-tions and methods already implemented by the application’s native data structures.Finally,

Pressio does not need any information about the application’s spatial oper-

ROJECTION-BASED MODEL REDUCTION VIA GENERIC PROGRAMMING

Pressio are onlybased on the assumption that a target system is represented as a system of ODEs.

The notion of hyper-reduction (or sample mesh) was introduced in §

2, and consists of approximating thenonlinear operators (residual and, if needed, the Jacobian) of the full order modelby sampling only a subset of elements. Equipping a ROM with this technique yieldslarge computational savings, see e.g., [11], and is thus a critical component. A detaileddescription and comparison of various hyper-reduction techniques can be found in [11].In

Pressio , one can explore the eﬀects of hyper-reduction in two ways. Theﬁrst, which we refer to as “algebraic hyper-reduction”, requires the user to simplyprovide the desired weighting matrix, A , which Pressio applies to the FOM operatorsto extract the target subset of elements. The immediate beneﬁt is a much smallernonlinear minimization problem to solve, see Eqs. (2.5), (2.9) and (2.11). While easyto implement, the main drawback is that the FOM still needs to compute the fulloperators, whose cost typically dominates the cost to solve the minimization problem.This implies that one cannot expect to obtain substantial savings. Hence, the algebraichyper-reduction allows one to just mimic the eﬀect of a real implementation.The second (preferred) way requires an application to have the capability of eval-uating the nonlinear operators only at target locations of the mesh, i.e., a “samplemesh”. This scenario considerably reduces the cost of evaluating the FOM’s oper-ators, thus accentuating the beneﬁts of hyper-reduction, see e.g., [6]. This secondapproach, however, is strictly dependent on the application’s numerical method andmesh, and thus needs to be implemented by the user directly within the application.We remark, however, that while this task can be intrusive, it is paradoxically easier toimplement for complex applications, e.g., those based on an unstructured mesh. Onthe one hand, the complexity of these meshes is a disadvantage because one has tospecify the full connectivity between the elements. On the other hand, the complexityof an unstructured mesh is an advantage because it allows more freedom in how theelements are assembled, which clearly beneﬁts the construction of the sample mesh.This will be further discussed in § To increase the impact and widen the range of potential usersof

Pressio , we have developed pressio4py , a library of Python bindings to Pressio . pressio4py is implemented using pybind11 [25], a lightweight open-source librarythat leverages C++11 and metaprogramming to facilitate the interoperability betweenC++ and Python. The objective of pressio4py is twofold. First, to allow any Python application based on NumPy to use the capabilities in

Pressio without anyadditional glue code. The only requirement is that the Python application meets anAPI similar to the one shown in listing 1. Second, to enable a Python applicationto obtain a performance comparable to the one obtained with a C++ applicationnatively interfacing with

Pressio .Developing pressio4py has required minimal eﬀort, mostly in the form of add-onfunctionalities needing minor modiﬁcations to the core C++

Pressio code. This isa result of two main factors, namely the quality and ease of use of pybind11 , andthe modularity of

Pressio , which allowed us to develop extensible and maintainablebindings relatively quickly and easily. More speciﬁcally, the work involved two mainstages. One aimed at enabling

Pressio to support pybind11::array t . The latteris a class template in pybind11 to bind a NumPy array from the C++ side. Using https://github.com/Pressio/pressio4py F. RIZZI, P. J. BLONIGAN, AND K. T. CARLBERG

Listing 2

Sample Python adapter class satisfying the API for pressio4py . class Adapter : def __init__ ( self , * args ): def velocity ( self , x , t ): return self .f def applyJacobian ( self , x , B , t ): return self . Jac . dot (B) pybind11::array t allows us to optimize interoperability with Python since no copyneeds to occur when a NumPy array from the Python side is passed to a C++ functionaccepting a pybind11::array t . This also allows us to bypass the Python interpreter,and perform all operations by working directly on the underlying data, thus yieldinga negligible overhead. A key property of pybind11::array t is that it has viewsemantics and, therefore, special attention is needed depending on whether a deepor shallow copy is needed. To achieve this, we leverage type introspection allowing Pressio to support pybind11::array t in a straightforward and natural way. Thesecond stage of the work involved developing the actual pybind11 -based binding codeto expose the

Pressio functionalities to Python. This stage beneﬁted from the ease ofuse and power of pybind11 , which is written to naturally target C++ and, in mostsituations, leading to simple binding code. This is what we observed for

Pressio . Forexample, the binding code to expose the basic LSPG functionalities is less than 100lines.Listing 2 shows an example of a Python class satisfying the API for pressio4py ,revealing clear similarities with the C++ listing 1. When instantiating a ROM prob-lem, the user can decide to provide an object with custom algebra operations (e.g., tospecify how to perform matrix-vector products) for pressio4py to use where needed,or let pressio4py rely on default implementation of these operations. In the latterscenario, pressio4py operates by making direct calls to NumPy and SciPy function-alities to act directly on pybind11::array t data. This is enabled by importing theNumPy or SciPy modules directly inside the C++ code as pybind11::objects . Forexample, if we need to compute a on-node matrix-vector product, we obtain optimalperformance by calling the Blas function dgemv to operate directly on the underlyingmatrix and vector data. One scenario where this is not entirely possible is when auser wants to run a ROM with hyper-reduction. In this case, the user is requiredto provide custom functionalities, namely the computation of the time-discrete op-erators. Nothing else changes as far as matrix and vector manipulations since these

ROJECTION-BASED MODEL REDUCTION VIA GENERIC PROGRAMMING pressio4py .

4. Results.

This section aims at demonstrating the use of

Pressio in two diﬀer-ent scenarios: in § § pressio4py . This section presents representative re-sults obtained using SPARC (Sandia Parallel Aerodynamics and Reentry Code).SPARC is a Computational Fluid Dynamics (CFD) code implemented in C++ forsolving aerodynamics and aerothermodynamics problems. It solves the compressibleNavier–Stokes and Reynolds-averaged Navier–Stokes (RANS) equations on structuredand unstructured grids using a cell-centered ﬁnite-volume discretization scheme, andthe transient heat and associated equations for non-decomposing and decomposingablators on unstructured grids using a Galerkin ﬁnite-element method [24]. SPARCrelies on the distributed numerical linear algebra data structures in Tpetra which, inturn, leverages the Kokkos programming model to achieve performance portabilityacross next-generation computing platforms. The reader is referred to [24] for detailson the governing equations and code design of SPARC.The version of SPARC we used for our test did not natively expose the API neededby

Pressio shown in 1. However, its modular design made the implementation of anadapter class quite straightforward. In fact, the “velocity” and the evaluation of theJacobian functionalities were already present within SPARC—used by the native codewithin each nonlinear solver iteration—despite being buried within a few abstractionlayers. Therefore, we just needed to create an adapter class bypassing these layersand accessing these functionalities directly. Having access to the spatial Jacobian,computing the applyJacobian required by

Pressio (see 1) was straightforward be-cause it simply involved calling a method of the native Jacobian’s data structure tocompute its action on the operand B . Overall, we can say that implementing theadapter class was quite easy, requiring no modiﬁcations to the core SPARC code, butjust the addition of a few methods including those to initialize the discretization, readin basis vectors, and post-process the ROM results. This experience is an example ofwhat was discussed in § Pressio due to its modular and extensible design.As a test case, we choose a CFD simulation of the Blottner sphere, a commonlyused validation case for hypersonic CFD solvers [7]. It models a hypersonic ﬂowimpacting on a semi-sphere similar to the round tips commonly found on hypersonicaircraft and spacecraft. A schematic of the problem is shown in Figure 2 (a). Themain feature is the bow shockwave just upstream of the sphere, which can be diﬃcultto capture accurately due to the shock strength and a numerical phenomena know asthe carbuncle problem [29]. Non-dimensional quantities characterizing the problemare the Mach number, M = 5 .

0, and the Reynolds Number, Re = 1 , , , ,

304 cells shown in Figure 2 (b). The mesh is designed so that the cells arewell aligned with the bow shock and the boundary layer on the sphere surface, andit extends upstream enough to capture the bow shock but not so far so as to controlthe cell count. Since the ﬂow is modeled as laminar, there are 5 conserved quantities,yielding the total FOM degrees of freedom to be N = 20 , , M = 5 . F. RIZZI, P. J. BLONIGAN, AND K. T. CARLBERG s e m i - s ph e r e s h o c k d o m a i n ρ ∞ V ∞ T ∞ (a) (b) (c) Fig. 2 . Panel (a): schematic of Blottner sphere, where ρ ∞ , V ∞ and T ∞ refer to the free-stream density, velocity and temperature, respectively. The computational domain is displayed witha dashed line. The bow shock to the left of the semi-sphere is where the hypersonic airﬂow almostsuddenly decelerates to subsonic velocity, with density and temperature increasing near the spheresurface. Note that the schematic is shown in two dimensions for clarity, but the actual simulationis fully three-dimensional. Panel (b): angled front view of the full 3D mesh used for the full-ordermodel of the Blottner sphere. Panel (c): angled front view of one of the sample meshes used to runthe ROMs with hyper-reduction, which is composed of , randomly selected cells in which theresidual is sampled, along with neighboring cells and neighbors of neighbors, yielding , cells intotal, equivalent to ∼ of the full mesh. The pseudo time stepping is performed using the (implicit) backward Euler schemewith ﬁxed time step ∆ t = 2 . × − seconds, yielding 64 time instances. At eachtime instance, the nonlinear system resulting from the implicit time discretization issolved using a Newton–Raphson solver until the ‘ -norm of the residual is reducedby three orders of magnitude [24]; the initial guess is taken to be the solution at theprevious time instance. During the pseudo-transient phase, the bow shock upstreamof the sphere moves and deforms until it reaches a steady state.We consider the LSPG method described in § µ —which in this case correspond to boundary conditions and physical constants—and assess the ability of the ROM to reproduce the FOM results. A predictive ROMstudy is outside the scope of the present work, and we refer the reader to [6] for apaper showing ROMs developed in Pressio yielding good performance in a predictivesetting for a hypersonic-ﬂow problem. To compute the basis matrix Φ , we takesnapshots of the full-order model state at all 64 time instances, subtract the initialcondition x ( µ ) from each snapshot, and compute the corresponding POD modes.We consider four cases, namely p = 8, 16, 32, and 64 modes. For the ROM, weuse the same time discretization method and time step size as the FOM. The LSPGminimization problem in Eq. (2.9) employs x ref = x and is solved using the Gauss–Newton method solver implemented in Pressio that assembles and solves the normalequations, terminating when the ‘ -norm of the normal-equations relative residual isreduced by three orders of magnitude; the initial guess is taken to be the solution atthe previous time instance. We consider two diﬀerent choices of the weighting matrix A for the problem in Eq. (2.9):

1. Diagonal matrix without hyper-reduction: we choose A = D ∈ R N × N , where D ∈ R N × N is a diagonal matrix whose elements correspond to the sizes of the controlvolumes in the ﬁnite-volume discretization divided by the time step ∆ t . We found that ROJECTION-BASED MODEL REDUCTION VIA GENERIC PROGRAMMING A greatly improves the convergence rate of Gauss–Newton iterations.This case is useful for demonstration purposes but is expected not to yield substantialcomputational savings because no hyper-reduction is employed. For brevity, belowwe refer to this case with the abbreviation sm to indicate that this uses 100% ofthe full mesh.

2. Diagonal matrix with collocation hyper-reduction: we choose A = P D ∈ R z × N , where D ∈ R N × N is the same as the case above, and the sample matrix P ∈ { , } z × N corresponds to z ( (cid:28) N ) randomly selected rows of the identity matrix.Note that, as suggested in [11], this random sampling is done such that we obtaincells at each boundary. To enable this in SPARC we use a sample mesh. The maindiﬃculty of using a sample mesh stems from the need to perform computations onits odd topology comprised of disjoint sets of cells. However, this proved relativelyeasy in SPARC because of its support for unstructured meshes—speciﬁcally those inthe EXODUS II ﬁle format [24]. Data structures for unstructured meshes are, infact, well-suited to handle a sample mesh (as anticipated in § § , sm , sm and sm , respectively. A visualization ofthe sample mesh containing 1% of the full mesh is shown in Figure 2 (c).All of the results presented below for the Blottner sphere are run on a Sandiacluster containing a total of 1 ,

848 nodes with Inﬁniband interconnect, each nodehaving two 8-core Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz and 65GB of memory.To run a FOM simulation, we use 512 MPI processes, distributed over 32 nodes. Thisconﬁguration follows SPARC’s requirements and allows us to balance the memoryfootprint of the application. The hyper-reduced ROMs run on a single node, sincethe sample mesh is small enough. The mesh for the ROM without hyper-reductionis distributed using the same number of nodes used for the FOM. We choose Kokkosfor the ROM data structures and computations. This is advantageous for two mainreasons: ﬁrst, it yields seamless interoperability with the Tpetra-based data structuresof SPARC; second, it allows us, in the future, to tackle more complex SPARC problemsby leveraging on-node threading models or directly on the GPU without modifying thecode. In this work, Kokkos is built with a serial backend. A detailed study comparingvarious threading models for running the ROM is left to future work.

Accuracy:

Figure 3 shows representative visualizations of the ﬂow, color-codedby the Mach number, and the sphere surface, color-coded by the local heat ﬂux.The plots show that the ROM faithfully captures the bow shock and high heatingaround the ﬂow stagnation point, yielding diﬀerences with the FOM that are nearlyindistinguishable. To quantify these diﬀerences, we report in Table 1 the relativeerrors of the ROM with respect to the FOM measured in the discrete ‘ - and ‘ ∞ -norms for the state vector on the entire computational domain, and vectors containingthe wall heat ﬂux and wall shear stress at all surface cell faces. For each sample meshexcept sm , the error decreases monotonically with the reduced dimension p (i.e.,the number of basis vectors). We emphasize that as the reduced dimension increases,monotonic decrease in the errors reported here is not guaranteed, even when full meshsampling is used. Overall, we can say that for all conﬁgurations explored, the ROM https://gsjaardema.github.io/seacas/html/index.html F. RIZZI, P. J. BLONIGAN, AND K. T. CARLBERG (a) (b) (c)

Fig. 3 . Three-dimensional visualizations of the Blottner sphere problem at the ﬁnal time step(i.e., steady state) displaying the ﬂow color-coded by the Mach number ( M ) and the sphere surfacecolor-coded by the local heat ﬂux. Panel (a) shows the results from a full-order model simulation.Panels (b,c) show those obtained from the LSPG ROM with 8 POD modes and % sample mesh:in (b) we show the sample mesh only, and in (c) we plot the ROM-based solution reconstructed overthe full mesh. accurately reproduces the FOM results. Since this is a reproductive ROM, when weuse the full set of POD modes, p = 64, one would expect to recover the FOM statecommitting an error that is close to machine zero. For this test case, this does notoccur because, as anticipated before, the convergence criterion for both the FOM andROM nonlinear solvers is not the absolute norm of the residual reaching machinezero, rather a reduction of its relative norm. Hence, this tolerance poses a limitation.We remark that a convergence criterion based on the relative residual norm reductionis typical for large-scale CFD problems, since obtaining machine zero is generallyunfeasible. Performance:

Figure 4 summarizes performance results in terms of the computa-tional cost reduction factor of the LSPG ROM with respect to the FOM, computed asthe ratio between the FOM core-hours and ROM core-hours. The results are groupedby the sample mesh size. The computational savings for the hyper-reduced ROMsare substantial. The largest speed up ( > X ) is obtained with the 1% sample meshand p = 8, while the smallest ( ∼ X ) is obtained with the 10% sample mesh and p = 64. It is also interesting to note that the ROM without hyper-reduction yieldsa gain when using p = 8. As expected, if the reduced system is large enough andno hyper-reduction is used, the cost of the ROM is larger than the FOM, as seen forsm and p ∈ , , In this section, we discuss the per-formance and overhead incurred when using pressio4py rather than

Pressio directlyfrom C++. We consider a reproductive scenario with an inviscid Burgers problem

ROJECTION-BASED MODEL REDUCTION VIA GENERIC PROGRAMMING p state error heat-ﬂux error wall-shear error ‘ -norm ‘ ∞ -norm ‘ -norm ‘ ∞ -norm ‘ -norm ‘ ∞ -norm s m

08 5.013e-04 5.621e-03 2.912e-03 3.581e-03 1.644e-03 3.586e-0316 8.354e-05 9.039e-04 1.838e-04 4.528e-04 1.796e-04 6.041e-0432 2.807e-05 1.850e-04 2.193e-04 3.115e-04 1.517e-04 3.369e-0464 1.617e-04 2.412e-03 7.174e-04 1.086e-03 4.218e-04 8.570e-04 s m

08 4.396e-04 5.478e-03 1.594e-03 3.242e-03 1.055e-03 3.478e-0316 9.335e-05 9.974e-04 7.033e-04 1.105e-03 4.197e-04 1.040e-0332 1.114e-05 1.098e-04 9.334e-05 1.529e-04 4.903e-05 1.273e-0464 4.603e-06 5.229e-05 2.660e-05 4.312e-05 1.555e-05 4.300e-05 s m

08 4.404e-04 5.210e-03 1.754e-03 2.824e-03 9.739e-04 2.741e-0316 1.356e-04 1.016e-03 1.196e-03 1.844e-03 7.411e-04 1.786e-0332 1.000e-05 1.076e-04 9.340e-05 1.461e-04 4.829e-05 1.177e-0464 1.564e-06 2.426e-05 8.238e-06 1.429e-05 4.649e-06 1.325e-05 s m

08 5.318e-04 4.589e-03 2.066e-03 3.460e-03 1.184e-03 3.178e-0316 9.858e-05 9.816e-04 7.987e-04 1.228e-03 4.781e-04 1.159e-0332 7.368e-06 9.632e-05 6.368e-05 1.025e-04 3.205e-05 7.851e-0564 9.215e-07 1.520e-05 2.509e-06 4.087e-06 1.840e-06 4.707e-06 s m

08 5.073e-04 4.563e-03 2.048e-03 2.865e-03 1.162e-03 2.546e-0316 8.075e-05 1.033e-03 6.144e-04 9.334e-04 3.705e-04 9.445e-0432 3.753e-06 1.036e-04 1.257e-05 2.557e-05 6.534e-06 2.187e-0564 9.849e-07 7.747e-06 4.059e-06 6.564e-06 3.121e-06 8.260e-06

Table 1

Relative norms of the errors for target observables between the ROM prediction and the full-order model results of the Blottner sphere problem. The metrics are computed at steady state. written in the conservative form ∂u ( x, t ) ∂t + 12 ∂ ( u ( x, t )) ∂x = α exp( βx ) ,u (0 , t ) = γ, ∀ t > , (4.1) u ( x,

0) = 1 , ∀ x ∈ [0 , , where u is the state variable, x represent the spatial variable, t is time, and α , β , γ are real-valued parameters chosen as α = β = 0 .

02 and γ = 5. We use a ﬁnite-volumespatial discretization with N cell of uniform width and Godunov upwinding. Thesemi-discrete version of the problem above takes the form of Eq. (1.1). We explore full-order models of varying dimension N = 2 k , with k ∈ { , , , , , } . Below, wepurposefully limit our attention to performance and overhead, and omit a discussionof the accuracy of the ROM which, for the problem above, can be found in, e.g., [28].We have developed two implementations of the full-order problem in (4.1), onein Python that meets the API shown in listing 2, and one in C++ using Eigen data structures that meets the API shown in listing 1. The Python implementationuses NumPy and SciPy, and is optimized with Numba decorators to translate targetloops—in the velocity and Jacobian calculation—to optimized machine code at run- http://eigen.tuxfamily.org F. RIZZI, P. J. BLONIGAN, AND K. T. CARLBERG

Fig. 4 . Performance results for the Blottner sphere test case showing the core-hours reductionfactor of the LSPG ROM with respect to the full order model. Each data point is obtained bycomputing the geometric mean of a set of three runs. The hyper-reduced ROM cases (identiﬁed bylabels sm , sm , sm and sm ) are run on a single node, while the ROM without hyper-reduction(sm ) uses 32 nodes to distribute the mesh. time. We use OpenBLAS 0 . . built with GCC 9 . . p ∈ { , , , } . In all cases, we use A = I (the beneﬁts of hyper-reduction areoutside the scope of this section), use the initial condition u ( x,

0) = 1 as referencestate, and employ a linear trial subspace for projection. To compute the basis forGalerkin, we run the full-order model over the interval t ∈ [0 , . u ( x,

0) = 1 from each snapshot and computing the POD. To compute the basis vectorsfor LSPG, we use the same procedure except that a backward Euler time scheme isused over a time interval t ∈ [0 , . p , and we use a time step ∆ t = 5 × − , ensuring stability for all simulations.For LSPG, we solve the nonlinear least-squares problem at each time step using aGauss–Newton solver implemented in Pressio that assembles and solves the normalequations. To solve the normal equations, we use the getrf direct LU-based solverfrom LAPACK. In Python, one can directly call getrf via the low-level LAPACK scipy.linalg.lapack . In Eigen, built with LAPACK support, getrf is the implementation underlying the LU decomposition of a dense matrix.For LSPG, an important detail is the data structure type used for the spatialJacobian arising from the semi-discrete formulation of the system in (4.1). In general,spatial Jacobian operators for problems involving local information (e.g., ﬁnite diﬀer-ences and ﬁnite volumes) are sparse by deﬁnition and, thus, typically represented assparse matrices due to the smaller memory footprint. For the purposes of this analysis,one caveat is that using a sparse Jacobian implies using the native implementations inSciPy and Eigen, since we cannot rely on a common backend. For a dense Jacobian,instead, as anticipated before, we can rely on the same BLAS backend. To this end,we developed two versions of the spatial Jacobian for problem (4.1), one relying ona sparse and one on a dense matrix representation. This allows us to quantify theoverhead of pressio4py with and without biases due to diﬀerent implementations.For all dense operators, we use column-major ordering, yielding seamless compatibil-ity with the native Fortran API of BLAS/LAPACK, while for the sparse Jacobian weuse the Compressed Sparse Row (CSR) format.Figure 5 shows the results obtained for Galerkin (a), and LSPG when using asparse (b) and dense (c) matrix for the application’s spatial Jacobian. Each plotdisplays, in a semi-log scale, the time per iteration (in milliseconds) as a function ofthe mesh size, obtained with the native C++ in Pressio and pressio4py for selectedvalues of the ROM size, p . Each data point is the geometric mean of a sampleof 10 replica runs, and error bars are omitted because the statistical variability isnegligible. All runs were completed as single-threaded processes on a dual socketmachine, containing two 18-core Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz and125GB of memory.The plots show that Galerkin has the smallest time cost per iteration, while LSPGbased on a dense Jacobian has the largest time per iteration. Figure 5 (a) showsthat for Galerkin, up to about N = 4096, the C++ version perform slightly betterthan the Python one. However, as the mesh size increases, the two codes exhibitnearly identical performance. As expected, panel (b) shows that for LSPG with asparse spatial Jacobian, the diﬀerences between C++ and Python are more evident,revealing the bias due to the diﬀerent implementation. Panel (c) shows that whenLSPG is run with a dense spatial Jacobian, the diﬀerences between the two codes arenearly indistinguishable. This analysis demonstrates that the overhead incurred dueto the Python layer of pressio4py is eﬀectively negligible.

5. Conclusions.

This article presented

Pressio , an open-source project aimedat enabling leading-edge projection-based reduced order models (ROMs) for large-scale nonlinear dynamical systems. We discussed the algorithms currently supportedand those planned for upcoming releases, including the support for both linear andnonlinear trial manifolds, and the applicability to reduce both spatial and temporaldegrees of freedom for any system expressible as a parameterized set of ODEs.We described the minimal API required by

Pressio , which is a natural choicefor dynamical systems. To give an glimpse of the software design choices behind

Pressio , we highlighted the use and beneﬁts of generic programming. We discussedhow the core C++ implementation of

Pressio is complemented with Python bindingsto expose these functionalities to Python users with negligible overhead and no user-required binding code.We presented our experience with a production-level aerodynamic code, highlight-ing the minimal additions to the native code needed to construct the sample mesh and4

F. RIZZI, P. J. BLONIGAN, AND K. T. CARLBERG

Mesh Size, N T i m e p e r i t e r a t i o n ( m s ) pressio-C++pressio4pyp=256p=128p=64p=32 (a) Mesh Size, N T i m e p e r i t e r a t i o n ( m s ) pressio-C++pressio4pyp=256p=128p=64p=32 (b) Mesh Size, N T i m e p e r i t e r a t i o n ( m s ) pressio-C++pressio4pyp=256p=128p=64p=32 (c) Fig. 5 . Figure (a): time per iteration (milliseconds) as a function of the mesh size obtained forGalerkin ROM applied to problem (4.1) using the native C++ in

Pressio and pressio4py for fourdiﬀerent ROM sizes. Figure (b) shows the corresponding results obtained for LSPG using a sparsematrix for the spatial Jacobian in the Burgers implementation, while (c) shows those obtained usinga dense Jacobian matrix. apply the LSPG ROM. We showed the substantial computational performance gainsobtainable using a sample mesh, and discussed how this should be evaluated in thecontext of a tradeoﬀ between performance and accuracy. We then showed an exam-ple using the Python bindings library, pressio4py , which has a negligible overheadcompared to using the core C++

Pressio .We designed and implemented

Pressio to be usable for both large-scale simula-tions, where performance is critical, and prototyping problems, where the focus is onassessing the accuracy of a new ROM technique. We are working on expanding the setof demonstration cases to explore advantages and disadvantages of ROMs for varioustypes of problems, further improvements to performance, investigating various samplemesh techniques, and providing a default, automated functionality to run nonlinearROMs based on convolutional autoencoders.

Acknowledgments.

The authors thank Matthew Zahr for extremely helpfuland insightful conversations that led to the inception of this project. The authorsalso thank Eric Parish (currently working on the WLS implementation), MatthewDavid Smith, and Mark Hoemmen for helpful suggestions and feedback. This paperdescribes objective technical results and analysis. Any subjective views or opinionsthat might be expressed in the paper do not necessarily represent the views of the U.S.Department of Energy or the United States Government. Supported by the Labora-tory Directed Research and Development program at Sandia National Laboratories, amultimission laboratory managed and operated by National Technology and Engineer-ing Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International,Inc., for the U.S. Department of Energy’s National Nuclear Security Administration

ROJECTION-BASED MODEL REDUCTION VIA GENERIC PROGRAMMING

REFERENCES[1]

C. Bach, L. Song, T. Erhart, and F. Duddeck , Stability conditions for the explicit inte-gration of projection based nonlinear reduced-order and hyper reduced structural mechan-ics ﬁnite element models , arXiv e-prints, (2018), arXiv:1806.11404, p. arXiv:1806.11404,https://arxiv.org/abs/1806.11404.[2]

S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin,A. Dener, V. Eijkhout, W. D. Gropp, D. Karpeyev, D. Kaushik, M. G. Knepley,D. A. May, L. C. McInnes, R. T. Mills, T. Munson, K. Rupp, P. Sanan, B. F. Smith,S. Zampini, H. Zhang, and H. Zhang , PETSc users manual

S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith , Eﬃcient management of parallel-ism in object oriented numerical software libraries , in Modern Software Tools in ScientiﬁcComputing, E. Arge, A. M. Bruaset, and H. P. Langtangen, eds., Birkh¨auser Press, 1997,pp. 163–202.[4]

M. Baumann, P. Benner, and J. Heiland , Space-time Galerkin POD with applicationin optimal control of semi-linear parabolic partial diﬀerential equations , arXiv preprintarXiv:1611.04050, (2016).[5]

B. A. Belson, J. H. Tu, and C. W. Rowley , Algorithm 945: Modred — A ParallelizedModel Reduction Library , ACM Trans. Math. Softw., 40 (2014), pp. 30:1–30:23, https://doi.org/10.1145/2616912.[6]

P. Blonigan, K. Carlberg, F. Rizzi, M. Howard, and J. Fike , Model reduction for hy-personic aerodynamics via conservative LSPG projection and hyper-reduction , in AIAASCITECH Forum, vol. 1, Orlando, FL, 2020, AIAA.[7]

F. G. Blottner , Accurate Navier-Stokes results for the hypersonic ﬂow over a sphericalnosetip , Journal of Spacecraft and Rockets, 27 (1990), pp. 113–122, https://doi.org/10.2514/3.26115.[8]

M. Brand , Incremental Singular Value Decomposition of Uncertain Data with Missing Values ,in Computer Vision — ECCV 2002, A. Heyden, G. Sparr, M. Nielsen, and P. Johansen,eds., Berlin, Heidelberg, 2002, Springer Berlin Heidelberg, pp. 707–720.[9]

K. Carlberg, M. Barone, and H. Antil , Galerkin v. least-squares PetrovGalerkin projectionin nonlinear model reduction , Journal of Computational Physics, 330 (2017), pp. 693 – 734,https://doi.org/10.1016/j.jcp.2016.10.033.[10]

K. Carlberg, C. Bou-Mosleh, and C. Farhat , Eﬃcient non-linear model reduction viaa least-squares PetrovGalerkin projection and compressive tensor approximations , Inter-national Journal for Numerical Methods in Engineering, 86 (2011), pp. 155–181, https://doi.org/10.1002/nme.3050.[11]

K. Carlberg, C. Farhat, J. Cortial, and D. Amsallem , The GNAT method for nonlinearmodel reduction: Eﬀective implementation and application to computational ﬂuid dynam-ics and turbulent ﬂows , Journal of Computational Physics, 242 (2013), pp. 623 – 647,https://doi.org/10.1016/j.jcp.2013.02.028.[12]

K. Carlberg, R. Tuminaro, and P. Boggs , Preserving Lagrangian structure in nonlinearmodel reduction with application to structural dynamics , SIAM Journal on Scientiﬁc Com-puting, 37 (2015), pp. B153–B184.[13]

K. T. Carlberg, Y. Choi, and S. Sargsyan , Conservative model reduction for ﬁnite-volumemodels , Journal of Computational Physics, 371 (2018), p. 280314.[14]

H. Carter Edwards, C. R. Trott, and D. Sunderland , Kokkos , J. Parallel Distrib. Com-put., 74 (2014), pp. 3202–3216, https://doi.org/10.1016/j.jpdc.2014.07.003.[15]

S. Chaturantabut, C. Beattie, and S. Gugercin , Structure-preserving model reduction fornonlinear port-hamiltonian systems , SIAM Journal on Scientiﬁc Computing, 38 (2016),pp. B837–B865.[16]

Y. Choi and K. Carlberg , Space–Time Least-Squares Petrov–Galerkin Projection for Non-linear Model Reduction , SIAM Journal on Scientiﬁc Computing, 41 (2019), pp. A26–A58,https://doi.org/10.1137/17M1120531.[17]

Daversin, C., Veys, S., Trophime, C., and Prud´homme, C. , A reduced basis framework:Application to large scale non-linear multi-physics problems , ESAIM: Proc., 43 (2013),pp. 225–254, https://doi.org/10.1051/proc/201343015.[18]

C. Farhat, T. Chapman, and P. Avery , Structure-preserving, stability, and accuracy prop-erties of the energy-conserving sampling and weighting method for the hyper reduction of F. RIZZI, P. J. BLONIGAN, AND K. T. CARLBERG nonlinear ﬁnite element dynamic models , International Journal for Numerical Methods inEngineering, 102 (2015), pp. 1077–1110.[19]

C. Gu and J. Roychowdhury , Model reduction via projection onto nonlinear manifolds, withapplications to analog circuits and biochemical systems , in 2008 IEEE/ACM InternationalConference on Computer-Aided Design, IEEE, 2008, pp. 85–92.[20]

M. Hess and G. Rozza , ITHACA-SEM - In real Time Highly Advanced ComputationalApplications with Spectral Element Methods - Reduced Order Models for Nektar++ .https://github.com/mathLab/ITHACA-DG, 2019.[21]

J. S. Hesthaven, G. Rozza, and B. Stamm , Certiﬁed Reduced Basis Methods for ParametrizedPartial Diﬀerential Equations , SpringerBriefs in Mathematics, Springer International Pub-lishing, 2015, https://gitlab.com/RBniCS/RBniCS.[22]

A. C. Hindmarsh, P. N. Brown, K. E. Grant, S. L. Lee, R. Serban, D. E. Shumaker,and C. S. Woodward , SUNDIALS: Suite of nonlinear and diﬀerential/algebraic equationsolvers , ACM Transactions on Mathematical Software (TOMS), 31 (2005), pp. 363–396.[23]

P. Holmes, J. Lumley, and G. Berkooz , Turbulence, Coherent Structures, Dynamical Sys-tems and Symmetry , Cambridge University Press, 1996.[24]

M. Howard, A. Bradley, S. W. Bova, J. Overfelt, R. Wagnild, D. Dinzl, M. Hoemmen,and A. Klinvex , Towards Performance Portability in a Compressible CFD Code , in 23rdAIAA Computational Fluid Dynamics Conference, vol. 1, Denver, CO, 2017, AIAA, https://doi.org/10.2514/6.2017-4407.[25]

W. Jakob, J. Rhinelander, and D. Moldovan , pybind11 – Seamless operability betweenC++11 and Python , 2017. https://github.com/pybind/pybind11.[26] B. S. Kirk, J. W. Peterson, R. H. Stogner, and G. F. Carey , libMesh : A C++ Library forParallel Adaptive Mesh Reﬁnement/Coarsening Simulations , Engineering with Computers,22 (2006), pp. 237–254, https://doi.org/10.1007/s00366-006-0049-3.[27] D. J. Knezevic and J. W. Peterson , A high-performance parallel implementation of thecertiﬁed reduced basis method , Computer Methods in Applied Mechanics and Engineering,200 (2011), pp. 1455–1466, https://doi.org/10.1016/j.cma.2010.12.026.[28]

K. Lee and K. T. Carlberg , Model reduction of dynamical systems on nonlinear mani-folds using deep convolutional autoencoders , Journal of Computational Physics, 404 (2020),p. 108973.[29]

R. W. MacCormack , Carbuncle Computational Fluid Dynamics Problem for Blunt-BodyFlows , Journal of Aerospace Information Systems, 10 (2013), pp. 229–239, https://doi.org/10.2514/1.53684.[30]

D. S. Medina, A. St-Cyr, and T. Warburton , OCCA: A uniﬁed approach to multi-threadinglanguages , arXiv preprint arXiv:1403.0968, (2014).[31]

R. Milk, S. Rave, and F. Schindler , pyMOR - Generic Algorithms and Interfaces for ModelOrder Reduction , SIAM Journal on Scientiﬁc Computing, 38 (2016), pp. S194–S216, https://doi.org/10.1137/15M1026614.[32] D. R. Musser and A. A. Stepanov , Generic programming , in Proceedings of the InternationalSymposium ISSAC’88 on Symbolic and Algebraic Computation, ISAAC ’88, London, UK,UK, 1989, Springer-Verlag, pp. 13–25, http://dl.acm.org/citation.cfm?id=646361.690581.[33]

E. J. Parish and K. T. Carlberg , Windowed least-squares model reduction for dynamicalsystems , arXiv e-prints, (2019), arXiv:1910.11388, p. arXiv:1910.11388, https://arxiv.org/abs/1910.11388.[34]

V. Puzyrev, M. Ghommem, and S. Meka , pyROM: A computational framework for reducedorder modeling , Journal of Computational Science, 30 (2019), pp. 157 – 173, https://doi.org/10.1016/j.jocs.2018.12.004.[35] C. Rowley , Model reduction for ﬂuids, using balanced proper orthogonal decomposition , Int.J. on Bifurcation and Chaos, 15 (2005), pp. 997–1013.[36]

G. Rozza , ITHACA-DG - In real Time Highly Advanced Computational Applicationsfor Discontinuous Galerkin - ROMs for HopeFOAM . https://github.com/mathLab/ITHACA-DG, 2019.[37]

P. J. Schmid , Dynamic mode decomposition of numerical and experimental data , Journal ofﬂuid mechanics, 656 (2010), pp. 5–28.[38]

G. Stabile, S. Hijazi, A. Mola, S. Lorenzi, and G. Rozza , POD-Galerkin reduced ordermethods for CFD using Finite Volume Discretisation: vortex shedding around a circularcylinder , Communications in Applied and Industrial Mathematics, 8 ((2017)), pp. 210–236,https://doi.org/10.1515/caim-2017-0011, https://mathlab.sissa.it/ITHACA-FV.[39]

G. Stabile and G. Rozza , Finite volume POD-Galerkin stabilised reduced order methodsfor the parametrised incompressible Navier-Stokes equations , Computers & Fluids, (2018),https://doi.org/10.1016/j.compﬂuid.2018.01.035, https://mathlab.sissa.it/ITHACA-FV.ROJECTION-BASED MODEL REDUCTION VIA GENERIC PROGRAMMING [40] K. Urban and A. T. Patera , A new error bound for reduced basis approximation of parabolicpartial diﬀerential equations , Comptes Rendus Mathematique, 350 (2012), pp. 203–207.[41]

K. M. Washabaugh , Fast Fidelity for Better Design: A Scalable Model Order ReductionFramework for Steady Aerodynamic Design Applications , PhD dissertation, Stanford Uni-versity, Department of Aeronautics and Astronautics, Aug. 2016.[42]

K. Willcox and J. Peraire , Balanced model reduction via the proper orthogonal decomposi-tion , AIAA Journal, 40 (2002), pp. 2323–2330.[43]

M. Yano , A space-time Petrov–Galerkin certiﬁed reduced basis method: Application to theboussinesq equations , SIAM Journal on Scientiﬁc Computing, 36 (2014), pp. A232–A266.[44]

M. Yano , Discontinuous Galerkin reduced basis empirical quadrature procedure for model re-duction of parametrized nonlinear conservation laws , Advances in Computational Mathe-matics, (2019), pp. 1–34.

UPPLEMENTARY MATERIALS: PRESSIO: ENABLINGPROJECTION-BASED MODEL REDUCTION FOR LARGE-SCALENONLINEAR DYNAMICAL SYSTEMS ∗ FRANCESCO RIZZI † , PATRICK J. BLONIGAN ‡ , AND

KEVIN T. CARLBERG ‡§ SM1. Software design overview.

In this section, we provide an overview ofkey software engineering aspects of

Pressio aiming at (a) describing how we abstractmathematical concepts into classes; (b) providing a coarse-grained overview of thevarious components and their connections; and (c) explaining the design choices be-hind the API required by

Pressio . To convey the main ideas clearly and concisely,we describe at a high level representative classes needed for a Galerkin ROM withan explicit time integration, deliberately omitting overly ﬁne-grained details aboutclasses, methods and syntax, which we believe would be more distracting than useful.We refer the reader interested in the full implementation details to the source code of

Pressio and its documentation directly.Figure SM1 shows a (simpliﬁed) UML (Uniﬁed Modeling Language) class diagramof the main classes needed for a Galerkin ROM using an explicit time stepping scheme.For the sake of readability, wherever possible, we denote arguments and variables withthe same notation used in § DefaultProblem class.This class template represents the default Galerkin problem, namely Eq. 2.6 in the pa-per with A = I . It needs as template arguments a (tag) type stepperTag , and typesfor the ROM state, adapter class and decoder. stepperTag is used for tag dispatch-ing, and can be chosen from a set of options, e.g., ode::explicitmethods::Euler or ode::explicitmethods::BDF2 . The variadic nature of the class facilitates the usersince the template arguments can be passed in arbitrary order. We use metapro-gramming to expand the pack and detect the needed types according to speciﬁcconditions. For example, a type is detected as a valid adapter class type onlyif it satisﬁes the API discussed in listing 1 in the paper. Other conditions areused to detect the state and decoder types. If Pressio does not ﬁnd admissibletemplate arguments, a compile-time error is thrown. An object instantiated fromthe class template

DefaultProblem is responsible for constructing a policy of type

DefaultVelocityPolicy , a stepper object of type

ExplicitStepper , and a recon-structor object of type

FomStateReconstructor . The policy deﬁnes how the right-hand-side (velocity) of the Galerkin system is computed, the stepper deﬁnes how toadvance the system over a single time step, and the reconstructor holds the informa-tion and data (i.e., the decoder) to map a given ROM state vector to a FOM modelstate. Once the Galerkin problem is instantiated, the contained stepper object canbe used with one of the integrators available inside

Pressio / ode .The UML diagram highlights that the application only interacts with the policy ∗ Submitted to the editors February 6th, 2020. † NexGen Analytics, 30N Gould St. Ste 5912, Sheridan, WY, 82801, USA ([email protected], [email protected]). ‡ Department of Extreme-Scale Data Science and Analytics, Sandia National Laboratories, Liver-more, CA, 94550, USA ([email protected], [email protected]). § Departments of Applied Mathematics and Mechanical Engineering, University of Washington,Seattle, WA 98195, USA SM1 a r X i v : . [ c s . M S ] A p r M2 F. RIZZI, P. J. BLONIGAN, AND K. T. CARLBERG

PressioAPPLICATIONoderom 11111 1

ExplicitStepper // scalarT, velocityPolicyT are extracted from the pack ...args...– policyObj : velocityPolicyT+ doStep(ˆ x : stateT, t : scalarT, dt : scalarT, step : stepT) : voidstepperTag, stateT, ...args Decoder ...+ applyMapping(ˆ x : romStateT, ˜ x : fomStateT) : void { // compute ˜ x = g (ˆ x ) } ...args FomStateReconstructor – x ref : fomStateT– decoderObj : decoderT+ operator()(ˆ x : romStateT, ˜ x : fomStateT) : void { // call decoderObj.applyMapping(ˆ x , ˜ x ) and do ˜ x + = x ref } fomStateT, decoderT galerkin::explicit::DefaultVelocityPolicy using fomStateT = typename adapterT::state type;using fomVelocityT = typename adapterT::velocity type;– ˜ x : fomStateT– f : fomVelocityT– fomReconstructor : FomStateReconstructor < fomStateT, decoderT > + operator()(ˆ x : romStateT, ˆ f : romStateT, fomObj : adapterT, t : scalarT ) : void { // query fomObj to compute f and then compute ˆ f } adapterT, decoderT galerkin::DefaultProblem // adapterT, decoderT are extracted from the pack ...argsusing policyT = galerkin::explicit::DefaultVelocityPolicy < adapterT, decoderT > ;using stepperT = ode::ExplicitStepper < stepperTag, romStateT, policyT > ;...– stepperObj : stepperT+ getStepper() : stepperT stepperTag, romStateT, ...args Adapter // typedefs: scalar type, state type, velocity type, dense matrix type...– problemObj : Application// for an explicit scheme, we only need velocity methods+ velocity(˜ x : state type, t : scalar type) : velocity type+ velocity(˜ x : state type, t : scalar type, f : velocity type) : void Application ......

Fig. SM1 . UML class diagram of the main classes deﬁning a Galerkin ROM using an explicittime stepping scheme. For the sake of readability, we adopt a few simpliﬁcations: (a) whereverpossible we use the same notation used in § <> to represent templates, and theC++ keyword using to represent type aliases in classes, respectively. These minimal simpliﬁcationsallow us to make the diagram substantially more readable and to highlight the interdependencies ina clearer way. UPPLEMENTARY MATERIALS: PROJECTION-BASED MODEL REDUCTION VIA GENERIC PROGRAMMING

SM3class. The policy object, in fact, calls the velocity method inside the adapter, whichin turn holds all the information about the FOM problem. Minimizing the number ofdirect connections between the application and

Pressio yields a much leaner structureless prone to errors. The policy-based design seems a suitable choice allowing us tocreate new problems by only operating at a low level while keeping the interactionbetween the various components and the workﬂow intact. This is in fact how wesupport variations of the basic Galerkin: diﬀerent policy (and problem) classes canbe implemented encapsulating diﬀerent ways to compute the right-hand-side shownin the formulations in Eq. 2.6.

SM1.1. Example main ﬁle for Galerkin ROM.

The UML diagram descrip-tion above enabled us to reason about the global structure and interdependenciesbetween the classes. This section presents a sample main ﬁle and a description of howto create and run a Galerkin ROM. Listing 1 shows the skeleton of a C++ main ﬁleto instantiate and run a default Galerkin ROM with forward Euler. We highlight thefollowing parts. On line 10, the user creates an object of the adapter class (which,underneath, is expected to create a full problem of the target application). A decoderobject is created on lines 15–17. In this case, the listing shows a linear decoder, whichis deﬁned by its Jacobian. On line 20, a reference state for the full order model isdeﬁned, and on line 23 the ROM state is deﬁned. The actual Galerkin problem isinstantiated on lines 26–30. The reader can cross-reference this class and its templatearguments with the UML diagram in Figure SM1. Finally, after the problem is cre-ated, the actual time integration is run on line 34. For simplicity, we show here howto integrate for a target number of steps, but other integration options are available.

Pressio , in fact, is implemented such that the concept of a stepper (whose responsi-bility is to only perform a single step) is separated from how the stepper is used toadvance in time. This allows us to easily support diﬀerent integrators (e.g. support-ing variable time steps, running until a ﬁxed time is reached, or until a user-providedcondition is met).

SM1.2. Software Design for LSPG.

Figure SM2 shows a simpliﬁed UML di-agram of the main classes composing an unsteady LSPG problem. The

LSPGProblem class is responsible for constructing an object of the policy class

LSPGResidualPolicy ,one of the policy class

LSPGJacobianPolicy , one of the stepper class

ImplicitStepper ,and one of the class

FomStateReconstructor . In this case, due to the implicit timeintegration, we need two policies, one for the time-discrete residual and one for thetime discrete Jacobian. The stepper deﬁnes how to advance the system over a singletime step, and the reconstructor holds the information and data (i.e. the decoder) tomap a given ROM state vector to a full-order model state. Once the LSPG problemis created, the associated stepper object is used to advance the ROM solution in time.To this end, given the implicit nature of the stepper, a solver object is needed asan argument to the doStep to solve the nonlinear system at a given time step. ForLSPG, the solver can be instantiated from one of the least-squares solver classes insidethe

Pressio / solvers currently supporting various types of Gauss-Newton nonlinearsolvers, e.g. using the normal equations or QR factorization with an optional linesearch. Alternatively, the user can provide a custom solver object. At compile-time, Pressio checks that the solver type is admissible by verifying that it satisﬁes a speciﬁcAPI, e.g. having a public method solve accepting speciﬁc arguments.The workﬂow is thus as follows: the time integrator (instantiated in the main, notshown here) chosen by the user calls stepperObj.doStep , which triggers the solvervia solver.solve , which in turn triggers a callback to stepperObj.residual andM4

F. RIZZI, P. J. BLONIGAN, AND K. T. CARLBERG

Listing 1

Sample C++ main to create and run (explicit) Galerkin using

Pressio . int main ( int argc , char * argv []){ using adapterT = /* adapter class type */ ; using scalarT = typename adapterT :: scalar_t ; using fomStateT = typename adapterT :: state_t ; using decoderJacT = typename adapterT :: dmatrix_t ; using romStateT = /* ROM state type */ ; using namespace pressio ; // create app or adapter object adapterT fomObj ( argc , argv , /* other args needed */ ); const std :: size_t romSize = /* set the Rom size */ ; // create the decoder , e.g. linear decoder decoderJacT phi = /* load / compute the decoder ’s Jacobian */ ; using decoderT = rom :: LinearDecoder < decoderJacT >; decoderT decoderObj ( phi ); // set the FOM reference state fomStateT xRef = /* load / compute the fom reference state */ ; // define ROM state romStateT romState ( romSize ); // create the Galerkin problem using stepperTag = ode :: explicitmethods :: Euler ; using rom :: galerkin :: DefaultProblem ; using galerkinT = DefaultProblem < stepperTag , romStateT , adapterT , decoderT >; galerkinT galerkinProb ( fomObj , xRef , decoderObj , romState ); // advance in time (t0 , dt , Nsteps must be defined somewhere ) // or one can use a different integrator ode :: integrateNSteps ( galerkinProb , romState , t0 , dt , Nsteps ); // map the final ROM state to the corresponding fom auto xFomFinal = galerkinProblem . fomReconstructor ()( romState ); return 0; } stepperObj.jacobian to compute the time-discrete quantities. The residual and jacobian methods in the ImplicitStepper do not hold any implementation details,because they simply call the compute method of the corresponding policy objects.Similarly to the Galerkin problem discussed before, the classes for LSPG are designedsuch that only the policies hold references to and interact with the application. Thispolicy-based design allows us to create various LSPG problems by only changing thepolicies, without altering the overall workﬂow.

UPPLEMENTARY MATERIALS: PROJECTION-BASED MODEL REDUCTION VIA GENERIC PROGRAMMING

SM5

PressioAPPLICATIONoderom 1 1111 11 11

ImplicitStepper //residualT, jacobianT, and policies’ types are detected from pack ... args– residualPolicy : residualPolicyT– jacobianPolicy : jacobianPolicyT...+ doStep(ˆ x : stateT, t : scalarT, dt : scalarT, step : stepT, solver : solverT) : void+ residual(ˆ x : stateT, r n : residualT) : void+ jacobian(ˆ x : stateT, ∂ r n /∂ ˆ x : jacobianT) : void stepperTag, stateT, ...args Decoder ...+ applyMapping(ˆ x : romStateT, ˜ x : fomStateT) : void...args FomStateReconstructor – x ref : fomStateT– decoderObj : decoderT+ operator()(ˆ x : romStateT, ˜ x : fomStateT) : voidfomStateT, decoderT LSPGResidualPolicy ...+ compute(ˆ x : romStateT, r n : residualT, fomObj : adapterT, xPrev : statesT, t : scalarT, dt : scalarT) : void { // call velocity method on fomObj and compute time-discrete residual r n } adapterT, decoderT, ...args LSPGJacobianPolicy ...+ compute(ˆ x : romStateT, ∂ r n /∂ ˆ x : jacobianT, fomObj : adapterT, t : scalarT, dt : scalarT) : void { // call applyJacobian method on fomObj and compute time-discrete jacobian ∂ r n /∂ ˆ x } adapterT, decoderT, ...args LSPGProblem // adapterT, decoderT: extracted from the pack ...argsusing residualPolicyT = LSPGResidualPolicy < adapterT, decoderT > ;using jacobianPolicyT = LSPGJacobianPolicy < adapterT, decoderT > ;using stepperT = ode::ImplicitStepper < stepperTag, romStateT, residualPolicyT, jacobianPolicyT > ;...– stepperObj : stepperT+ getStepper() : stepperT stepperTag, romStateT, ...args Adapter //scalar t is a public type alias for the scalar type//state t, velocity t are public aliases for full-order state and velocity//dense matrix t is a public alias for full-order dense matrix...– problemObj : Application+ velocity(˜ x : state t, t : scalar t, f : velocity t) : void+ applyJacobian(˜ x : state t, B : dense matrix t, t : scalar t, A : dense matrix t) : void+ velocity(˜ x : state t, t : scalar t) : velocity t+ applyJacobian(˜ x : state t, B : dense matrix t, t : scalar t) : dense matrix t... Application

Fig. SM2 . UML class diagram of the basic structure of an LSPG problem. We use the samenotation and simpliﬁcations as in Figure SM1. M6 F. RIZZI, P. J. BLONIGAN, AND K. T. CARLBERG

SM2. Repositories and Versions.

The project and its components are acces-sible at: • Project url: https://github.com/Pressio • C++ source code: https://github.com/Pressio/pressio • pressio4py source code: https://github.com/Pressio/pressio4py SM2.1. Code version used for § The version of

Pressio used to generatethe results in § tag to mark the commit in our development. The procedure is as follows:1. Clone the core C++ Pressio repository: git clone https://github.com/Pressio/pressio.git

2. Checkout the version used by doing: git checkout siscBlottnerSphere

SM2.2. Code versions used for § The version of

Pressio and pressio4py usedto generate the results in § Pressio repository: git clone https://github.com/Pressio/pressio.gitgit checkout siscBurgers1d

2. Clone pressio4py : git clone https://github.com/Pressio/pressio4py.gitgit checkout siscBurgers1d

3. Clone the code for Burgers1d:3. Clone the code for Burgers1d: