[PDF] Fully Parallel Mesh I/O using PETSc DMPlex with an Application to Waveform Modeling

Abstract

Large-scale PDE simulations using high-order finite-element methods on unstructured meshes are an indispensable tool in science and engineering. The widely used open-source PETSc library offers an efficient representation of generic unstructured meshes within its DMPlex module. This paper details our recent implementation of parallel mesh reading and topological interpolation (computation of edges and faces from a cell-vertex mesh) into DMPlex. We apply these developments to seismic wave propagation scenarios on Mars as an example application. The principal motivation is to overcome single-node memory limits and reach mesh sizes which were impossible before. Moreover, we demonstrate that scalability of I/O and topological interpolation goes beyond 12'000 cores, and memory-imposed limits on mesh size vanish.

Full PDF

FFULLY PARALLEL MESH I/O USING PETSC DMPLEX WITH ANAPPLICATION TO WAVEFORM MODELING ∗ VACLAV HAPLA † , MATTHEW G. KNEPLEY ‡ , MICHAEL AFANASIEV † , CHRISTIANBOEHM † , MARTIN VAN DRIEL † , LION KRISCHER † , AND

ANDREAS FICHTNER † Abstract.

Large-scale PDE simulations using high-order ﬁnite-element methods on unstruc-tured meshes are an indispensable tool in science and engineering. The widely used open-sourcePETSc library oﬀers an eﬃcient representation of generic unstructured meshes within its DMPlexmodule. This paper details our recent implementation of parallel mesh reading and topological in-terpolation (computation of intermediate codimensions from a cell-vertex mesh) into DMPlex. Weapply these developments to seismic wave propagation scenarios on Mars as a sample application.The principal motivation is to overcome single-node memory limits and reach mesh sizes which wereimpossible before. Moreover, we demonstrate that scalability of I/O and topological interpolationgoes beyond 12’000 cores, and memory imposed limits on maximum mesh size vanish.

Key words. unstructured mesh, directed acyclic graph, partitioning, topological interpolation,parallel I/O, PETSc, DMPlex, seismic waveform modeling, Spectral Element Method

AMS subject classiﬁcations.

1. Introduction.

Finite-element methods (FEM) [37, 24, 51] are widely usedin science and engineering to simulate complex physical systems. Many applicationsrequire FEM dicretization with high polynomial order on large unstructured meshesrequiring distributed memory computer architectures [1, 14, 46, 34, 2]. Here, twocomplications may arise:(1) The distributed memory mesh representation should explicitly include meshentities of all codimensions (vertices, edges, faces, cells). However, mesh dataﬁlestypically store only vertices and cells to avoid redundant storage, disk operations,and to conform to widely used formats. Edges and faces can be computed in runtimeusing a process which will be called here topological interpolation .(2) Meshing tools typically create a single dataﬁle, but loading the whole meshonto one processor and distributing onto all remaining processors becomes a bottle-neck for a suﬃciently large mesh. Instead, we need to load diﬀerent portions of themesh ﬁle directly onto target processors and maintain a distributed mesh representa-tion right from beginning. Additional load balancing and redistribution can be doneusing a parallel partitioner.PETSc DMPlex (section 2) is a ﬂexible mesh implementation which addresses(1) by design, i.e., it can represent any number of codimensions and implementstopological interpolation. Moreover, it is completely agnostic to the mesh shape, di-mensionality and cell types. However, only with developments presented in this paper,also (2) has been addressed. This in turn brought another challenge to implement thetopological interpolation from (1) in parallel.In this paper, we do not deal with hierarchical mesh representations such as oc-tree data structures [20, 47]. They are advantageous for certain data access patterns,especially in the context of Adaptive Mesh Reﬁnement (AMR) [20, 12], at the ex-pense of generality; DMPlex supports, for instance, arbitrary mesh partitions andextraction of arbitrary subsets of cells (or facets) as submeshes: features which are ∗ Submitted to the editors on April 18, 2020. † ETH Zurich, Switzerland ([email protected], andreas.ﬁ[email protected]). ‡ University at Buﬀalo, NY (knepley@buﬀalo.edu).1 a r X i v : . [ c s . M S ] A p r V. HAPLA ET AL. typically missing from hierarchical meshing frameworks [26]. Note that PETSc oﬀersthe DMForest wrapper of p4est [13, 25, 46], and conversion between DMPlex andDMForest [26].We demonstrate the eﬃciency and potential of our new DMPlex functionality ona real world application in the context of the full waveform modeling. The spectral-element method on unstructured conforming hexahedral meshes has become the de-facto standard for global-scale simulations of seismic waves [1, 18, 19, 44]. We applyit to simulate full 3D high-frequency wave propagation on Mars, based on data fromthe NASA InSight mission [6]. This consists in solving a coupled system of the elasticand acoustic wave equations. To accurately model these data in the desired frequencyband, large scale simulations are required. In the presented simulation, more than100 million 4-th order hexahedral mesh elements were used.Before the developments presented in this manuscript, with mesh reading andtopological interpolation being serial, mesh sizes were limited by the single nodememory. On our testing platform, Piz Daint supercomputer at the Swiss NationalSupercomputing Center, the mesh size limit was approximately 16 million elements,rendering such simulations impossible.The manuscript is organized as follows. First, we describe abstractions for meshdata management of unstructured meshes in high-order ﬁnite-element discretizationsusing PETSc DMPlex. Then we explain our new strategies for the parallel simulationstartup (mesh reading, topological interpolation and redistribution) on distributedmemory HPC architectures. We continue with a brief introduction of the spectral-element method and its implementation which uses DMPlex. Next, waveform mod-eling benchmarks demonstrate scalable performance of the parallel startup for up to256 million hexahedral mesh elements, running on up to 1024 Cray XC50 nodes of PizDaint. Finally, the mentioned Mars seismic wave propagation simulation is presentedas a practical application.

2. DMPlex.

PETSc [3, 4, 5, 10] is a well-known library for numerical meth-ods, used by the scientiﬁc and engineering computing communities. It provides par-allel data management, structured and unstructured meshes, linear and nonlinearalgebraic solvers and preconditioners, time integrators, optimization algorithms andothers. Many of these methods (such as geometric multigrid and domain decomposi-tion solvers) can take advantage of the geometric/topological setting of a discretizedproblem, i.e., mesh information.DMPlex is a PETSc module for generic unstructured mesh storage and operations.It decouples user applications from the implementation details of common mesh-related utility tasks, such as ﬁle I/O and mesh partitioning. It represents the meshtopology in a ﬂexible way, providing topological connectivity of mesh entities at allcodimensions (vertices, edges, faces and cells), crucial for high-order ﬁnite-elementmethod (FEM) simulations, and provides a wide range of common mesh managementfunctionalities.PETSc’s interface for serving mesh data to numerical algorithms is the DM object.PETSc has several DM implementations. The native implementations of structuredgrids ( DMDA ), staggered grids

DMStag , and unstructured meshes (

DMPlex ) have themost complete coverage of the DM API, and are developed most actively. Besidesthese two, PETSc has several DM implementations that wrap external libraries, suchas DMMOAB for MOAB [48] and

DMFOREST for p4est [13]. Here we will focus on

DMPlex which proved to be most relevant for the discussed waveform modeling applications.

DMPlex encapsulates the topology of unstructured grids and provides a wide range

ULLY PARALLEL MESH I/O USING PETSC DMPLEX

DMPlex uses an abstractrepresentation of the unstructured mesh topology, where the connectivity of topo-logical entities (vertices, edges, faces, cells) is stored as a directed acyclic graph(DAG) [38, 32], also refered to as a Hasse Diagram in topology. Let us call thisrepresentation a plex and refer to the topological mesh entities as points . The plexis constructed of clearly deﬁned layers ( strata ) that enable access to mesh entitiesby their topological codimension without explicit reference to the overall topologicaldimension d of the mesh, see Figure 1(b). Note also that d can be reconstructed inlinear time from an unlabeled plex using Depth First Search. Let us denote a set ofpoints at the same stratum as Stratum ( h ), where h is an integer height , h ≤ d + 1.All points in the plex share a single consecutive numbering, emphasizing thateach point is treated equally no matter its shape or dimension, and allowing DMPlex to store the graph connectivity in a single array where dimensional layers are deﬁnedas consecutively numbered subranges. The directional connectivity of the plex isdeﬁned by the covering relation called cone , denoted here as C ( p ), yielding a set of allpoints directly connected to p in the next codimension. The transitive closure of thecone operation shall be denoted by C + ( p ). Both C ( p ) and C + ( p ) are illustrated inFigure 1(d). The dual operation called support and denoted S ( p ), and its transitiveclosure S + ( p ) are shown in Figure 1(e).In addition to the abstract topology data, PETSc provides two utility objects todescribe the parallel data layout: a PetscSection object maps the graph-based topol-ogy information to discretized solution data through an oﬀset mapping, and a star for-est (

PetscSF , see subsection 3.2.1) object holds a one-sided description of shared datain parallel. These data layout mappings allow

DMPlex to manage distributed solutiondata by automating the preallocation of distributed vector and matrix data structuresand performing halo data exchanges. Moreover, by storing grid topology alongsidediscretized solution data,

DMPlex is able to provide the mappings required for sophis-ticated preconditioning algorithms, such as geometric multigrid methods [11, 17] andmultiblock, or “ﬁeldsplit” preconditioning for multiphysics problems [9].

For high order methods we are interested in,we need mesh entities of all codimensions (vertices, edges, faces, cells) be presentin the memory mesh representation. However, usual mesh generation algorithms ormesh ﬁle readers result in a mesh with cells and vertices only, while edges and facesneed to be inferred at runtime. This is accomplished by the so-called topologicalinterpolation.Topological Interpolation is the process of constructing intermediate levels of theranked poset describing a mesh, given information at bracketing levels. For example,if we receive triangles and their covering vertices, as in Figure 2, interpolation willconstruct edges. The ﬁrst algorithm for interpolation on the Hasse diagram waspublished in [38], but this version is only appropriate for simplices, ignores orientationof the mesh points, and did not give a complexity bound.The interpolation procedure selects a given point stratum as cells, for which it willconstruct facets. It iterates over the cells, whose cones are an oriented set of vertices.The two essential operations are to extract an oriented facet from lower dimensional

V. HAPLA ET AL.

32 54 (a) original 2D mesh (b) plex representation of (a),having 3 strata h = 2: vertices h = 1: edges h = 0: faces (c) coloring of

Stratum ( h ), h = 0 , , 𝐶(0)𝐶 % (0) (d) cone C (0),its transitive closure C + (0) 𝑆(5)𝑆 % (5) (e) support S (5),its transitive closure S + (5) Fig. 1: DMPlex mesh representation and basic relations. A 1D mesh would have 2strata: h = 0 edges, h = 1 vertices. A 3D mesh would have 4 strata: h = 0 cells, h = 1 faces, h = 2 edges, h = 3 vertices. Note we can also refer to the h = 1 entitiescommonly, in a dimension-independent way, as facets , and h = 0 as cells . The entitiesat h = d , where d is the topological dimension, are always vertices.points, and then attach it to a cell with the correct orientation. Orientation of meshpoints is detailed in subsection 2.3. In order to enumerate the facets for a givencell type, we have DMPlexGetRawFaces_Internal() which returns an oriented list ofvertices for each facet of the input cell.An initial iteration over cells constructs all facets, and enters them into a hashtable, where the hash key is the sorted list of vertices in each face. We need one passfor each type of face. Once the hash table is constructed, we know the number ofnew facets to be inserted, and can allocate a new plex. The new plex is identical tothe old, except that it has a new face stratum, and the cone sizes of cells need to becalculated anew. For example, hexahderal cells have 8 vertices, but 6 facets, so thatcone size would change from 8 to 6.A second iteration over cells inserts the facets. We clear the hash table and repeatthe face extraction above. If a face is missing from the table, we insert it into the table,record its cone, and also insert it into the cell cone with default orientation. If insteadit is present in the table, we insert it into the cell cone with orientation computed fromcomparing the face cone with that returned from

DMPlexGetRawFaces_Internal() .Again, we need one pass for each face type.The complexity to interpolate a given stratum is in O ( N C N F N T ), where N C is the ULLY PARALLEL MESH I/O USING PETSC DMPLEX

32 540 1 (a) original mesh

32 54 (b) mesh (a) interpolated (c) plex representation of (a) (d) plex representation of (b)

Fig. 2: Sequential topological interpolation: original mesh and interpolated mesh inclassical and plex representation.number of cells, N F is the maximum number of faces per cell, and N T is the numberof used face types. However, N F and N T are obviously constant, so O ( N C N F N T )= O ( N C ). The O -complexity remains the same even if we sum all strata becausethe size of (number of points in) stratum h = k + 1 is a certain multiple of the sizeof stratum h = k for any feasible height k . We can conclude that the complexity islinear with respect to the mesh size. Let us focus on two particular points inthe interpolated mesh in Figure 2(b): cell 0, and edge 7 ∈ C (0). Their cones are C (0) = { , , } and C (7) = { , } , respectively. We have so far spoken about which points form C ( p ) for given p . However, the orientation of p is also important; forinstance, to have a well-deﬁned direction of an outer normal, or to assign ﬁeld valuescorrectly during simulation. The orientation of p is represented as the order of pointsin C ( p ). Hence, e.g. C (0) is rather a tuple, C (0) = (6 , , C (0) is stored. Let us from now denote by C ( p, c ) the c -th point in thistuple, c = 0 , , . . . , n −

1, where n = size( C ( p )).For consistency, the orientation of points in plex should be in line with the ori-entation of their supporting points. There is never a problem for the lowermost anduppermost stratum; cells have no supporting points, and vertices have no orientation.However, for the intermediate strata (i.e. edges and faces computed by interpolation),we can get a conﬂicting situation as depicted in Figure 3. Edge 7 points against thedirection of cell 1 but if we ﬂip it, it will point against the direction of cell 0. Thuswe need a mechanism to allow for this.In general, suppose point p ∈ Stratum (0), its cone point q = C ( p, c ) ∈ C ( p ) ⊂ Stratum (1), and C ( q ) ⊂ Stratum (2). For example, in Figures 3(b) and 3(d), p = 1, c = 0, q = 7, C ( q ) = (3 , C ( q ), additional V. HAPLA ET AL. information about the proper interpretation order needs to be deﬁned. This informa-tion must be attached to the edge ( p , q ) in the DAG because it varies for diﬀerentchoices of p even for the same q . Naturally, the order is not completely arbitrary; itcan be described by (1) the starting point index S ( p, c ) ≥ q , and (2) the direction D ( p, c ) ∈ {− , } (reverse/forward). Thesetwo can be represented by a single signed integer O ( p, c ): S ( p, c ) by its magnitude,and D ( p, c ) by its sign. Since the sign is undeﬁned for 0, the negative values areshifted by -1. To summarize,(2.1) O ( p, c ) = (cid:40) S ( p, c ) , D ( p, c ) = 1 , − S ( p, c ) − , D ( p, c ) = − , and the other way around, D ( p, c ) = (cid:40) , O ( p, c ) ≥ , − , O ( p, c ) < , (2.2) S ( p, c ) = (cid:40) O ( p, c ) , O ( p, c ) ≥ , − O ( p, c ) − , O ( p, c ) < . (2.3)Note that for C ( p, c ) being an edge, ﬂipping the orientation of the edge implies chang-ing the starting point, so O ( p, c ) ∈ { , − } only. We can also deﬁne the whole tupleof orientations for point p ,(2.4) O ( p ) = ( O ( p, , . . . , O ( p, n − , where n = size( C ( p )). This O ( p ) is attached to every plex point p in the same way as C ( p ). We note that O ( p ) is just a numbering for the elements of the dihedral groupfor a given face.

3. Parallel simulation startup.

Let us call startup phase all steps necessaryto load the mesh from disk storage and prepare it for use in the simulation timeloop. It consists of the following steps: (1) raw data loading, (2) plex construction,(3) topological interpolation, (4) distribution.Before the developments of this paper, these steps were serial and only the laststep, a one-to-all distribution using a serial partitioner such as METIS [28, 30], re-sulted in the distributed mesh. This approach inevitably led to the upper limit onthe mesh size due to the memory constraints of a single node of a cluster. There-fore we developed a new, completely parallel startup phase where all four steps aredone in parallel right from the beginning. This parallel startup phase is schematicallydepicted in Figure 4. We further show that even for meshes that ﬁt into memory,parallel startup can bring signiﬁcant time and energy saving. Let us describe thesestages more in detail in the following subsections.

This stage forms the ﬁrst part of our mesh readerimplementation, and consists in reading distributed raw topology and geometry databy generic index set and vector readers, dominated by low level I/O operations.

Hierarchical Data Format 5 (HDF5) [22] is a ﬁle format andlibrary, designed to store and organize large amounts of N-dimensional array data. It iscurrently supported by the HDF Group, a not-for-proﬁt corporation whose mission isto ensure continued development of HDF5 technologies and the continued accessibilityof data stored in HDF.

ULLY PARALLEL MESH I/O USING PETSC DMPLEX

32 54 (a) edge 7 in cone of face 0, C (0 ,

1) = 7

32 54 (b) edge 7 in cone of face 1, C (1 ,

0) = 7 (c) the starting point of edge 7 within face 0is 3; C ( C (0 , , S (0 , C (7 ,

0) = 3 ,O (0 ,

1) = 0 (d) the starting point of edge 7 within face 1is 4; C ( C (1 , , S (1 , C (7 ,

1) = 4 ,O (1 ,

0) = − Fig. 3: Mesh from Figures 2(b) and 2(d) with cone points order and orientation. Wefocus here on the edge 7 within the cones of faces 0 and 1.

Topological interpolation: generate edges and facesfrom vertices and cellson-the-fly in parallel

Repartitioning: minimize partition interfaceusing parallel partitioner(optional)

Application (Salvus)FEM (SEM)parallel simulation

Load naively distributed DMPlex

DMLoad(DM, PetscViewer) with • DMType=DMPLEX • PetscViewerType=PETSCVIEWERHDF5 • PetscViewerFormat=PETSC_VIEWER_HDF5_XDMF

Distributed plex constructionRaw data loading

Parallel HDF5MPI-IOLustre parallel filesystem

Fig. 4: Schematic diagram of the parallel simulation startup. Grey stages are withinPETSc scope.

V. HAPLA ET AL.

HDF5’s ﬁle structure includes two major types of objects: (1) datasets, multidi-mensional arrays of a homogeneous type; (2) groups, container structures which canhold datasets and other groups. Every HDF5 ﬁle has a root group / , under which onecan add additional groups and datasets. This results in a hierarchical, ﬁlesystem-likedata format. Resources in an HDF5 ﬁle can be accessed using the POSIX-like syntax /group1/group2/dataset . Metadata is stored in the form of user-deﬁned, namedattributes attached to groups and datasets [22].HDF5 transparently handles how all the objects map to the actual bytes in theﬁle. HDF5 actually provides an abstracted ﬁlesystem-within-a-ﬁle that is portableto any system with the HDF5 library installed, regardless of the underlying stor-age type, ﬁlesystem, or endianess. It does automatic conversions between storagedatatypes (dictated by the data ﬁle) and runtime memory datatypes (dictated by theapplication) [22].HDF5 supports parallel shared-ﬁle I/O using MPI-IO [40] capabilities which inturn provide scalable access to the underlying parallel ﬁlesystem such as Lustre [42].By default, HDF5 provides uniform access to all parts of the ﬁle for all processesof the communicator (passed to HDF5 using H5Pset_fapl_mpio() ). However, sincePETSc uses data parallelism, it would be very ineﬃcient to load all data to all pro-cesses and then distribute them again. The most important functionality in thisregard are hyperslabs which can read or write to a portion of a dataset. A hyperslabcan be a logically contiguous collection of points in a dataspace, or a regular pat-tern of points or blocks in a dataspace. The hyperslab can select a separate chunkof the ﬁle for each process individually by means of a rank-dependent oﬀset. The

H5Sselect_hyperslab() function is used for this purpose [23].

XDMF (eXtensible Data Model and Format) [50] is a meshdata ﬁle format. It distinguishes the metadata (light data) and the values themselves(heavy data). Light data and heavy data are stored using XML and HDF5, respec-tively. The data format is stored redundantly in both XML and HDF5. There aretwo crucial datasets describing the mesh in a minimal suﬃcient way: (1) ,a 2D dataset where each row contains coordinates of a vertex (2 or 3 scalars basedon dimensionality); (2) , a 2D dataset where each row represents a cell,listing indices of all incident vertices. Each vertex index in correspondsto a row index within ). Both these datasets can be deﬁned within theXML ﬁle as plain text, or refer to a dataset path within the standalone HDF5 ﬁle (e.g.

MyData.h5:/geometry/vertices ). Both ways can be mixed within the same XDMFﬁle, and are both supported by widely used visualization programs such as ParaView,VisIt and EnSight. Nevertheless, the former way is advisable only for small datasets.We always store all data in the HDF5 ﬁle and use the XDMF ﬁle only as a descriptor.We will refer to this as HDF5/XDMF format.

PETSc contains a class, called

PetscViewer , des-ignated for all I/O of any

PetscObject such as a vector, matrix, linear solver. Thesource/destination is dictated by the type (

PetscViewerType ) such as ascii , binary , hdf5 , socket and others. A more ﬁne-grained control of how the object is read/viewedis accomplished with the viewer format ( PetscViewerFormat ); e.g. ascii_info and ascii_info_detail print plain text information about the object with a diﬀerentlevel of verbosity.Most relevant for this work is that DM supports both reading and writing with PetscViewer , using

DMLoad(DM,PetscViewer) and

DMView(DM,PetscViewer) , re-spectively. Recently, we have implemented a new

PetscViewerFormat hdf5_xdmf

ULLY PARALLEL MESH I/O USING PETSC DMPLEX

PetscViewerType hdf5 to enable parallel reading and writing of unstruc-tured meshes represented in memory as plex ( DM with DMType plex ) and stored inan HDF5 ﬁle with topology and geometry data compatible with XDMF (see subsec-tions 3.1.1 and 3.1.2). HDF5/XDMF has become the ﬁrst widely used mesh formatsupported by PETSc which is both readable and writable in parallel.As we are here interested mainly in the simulation startup phase, we will nowfocus only on

DMLoad() , namely its implementation for the combination of

DMType , PetscViewerType and

PetscViewerFormat described above. This implementationrelies on lower level readers for integer vectors (index sets, IS ) and scalar vectors ( Vec ): ISLoad() and

VecLoad() . They both read datasets from given paths within theHDF5 ﬁle (see subsection 3.1.1) using the

H5Dread() routine but with diﬀerent HDF5datatypes (

H5T_NATIVE_INT and

H5T_NATIVE_DOUBLE , respectively). They make useof a hyperslab (see subsection 3.1.1) reﬂecting the given parallel layout (

PetscLayout )to divide the global dataset into local chunks along the ﬁrst dimension. The layout iseither speciﬁed by user, or calculated automatically so that the chunks’ lengths diﬀerby 1 at most. The second dimension of the dataset is interpreted as a block size,i.e., the resulting vector is divided into equally sized shorter blocks. Blocks can havevarious contextual meanings such as DOFs of the same element.Within

DMLoad() , VecLoad() loads the geometry information ( insubsection 3.1.2) into a

Vec , whose blocks and entries represent vertices and theircoordinates, respectively. The size of all blocks is the same and corresponds to thespatial dimension of the mesh, and global indices of the blocks form an implicit globalvertex numbering.

ISLoad() then loads the cell-vertex topology information which is represented inXDMF by (subsection 3.1.2). Each IS block corresponds to an element,and each single entry refers to one of this element’s vertices using the implicit globalvertex numbering. Global indices of the blocks form an implicit global cell numbering.When we read such representation in parallel, all processes load approximatelyequally sized, contiguously numbered, disjoint portions of vertices and cells in a singleparallel I/O operation. Note that the global vertex and cell numberings are notaﬀected by parallel loading on any number of processes. The raw topology and geometry dataloaded in subsection 3.1 need to be transformed into a plex representation. This formsthe second part of

DMLoad() , and is realized by a call to a communication-boundedoperation described in detail in subsection 3.2.2. The resulting

DMPlex instance isnaively distributed with vertices and cells only. Let us ﬁrst describe a parallel starforest graph implementation, which allows gluing together serial plexes across diﬀerentprocesses.

The Star Forest (SF, in PETSc called

PetscSF [4, 8]) is aforest of star graphs. The root of each star graph corresponds to something ownedby a process, such as a mesh point or solution degree-of-freedom (DOF). The leaves of each star graph are the shared versions of that point or DOF on other processes,or what is often called ghost points. In particular, the SF is a map from local points p to remote points ( r, q ), where r is the remote rank and q is the local number forthe point on process r . Since the SF only deals with local numberings, not globalnumberings, parallel mesh modiﬁcation is much easier. Moreover, SF accepts databuﬀers with arbitrary MPI types for its communication routines.SF supports a broadcast operation from roots to leaves, as well as a reductionfrom leaves to roots. In addition, it supports gather and scatter operations over the0 V. HAPLA ET AL. [0] [1] (a) mesh from Figure 2(a)after distribution [0] [1] (b) mesh from (a)after interpolation [0] [1] (c) plex representation of (a) [0] [1] (d) plex representation of (b)

Fig. 5: Parallel DMPlex. Grey dotted arrows denote sfPoint .roots, inversion of the SF to get two-sided information, and a fetch-and-op atomicupdate. Plex can build a hierarchy of SF objects describing the overlap of partitions,point set, topology, data layout, and Jacobian information [33].

Once the rawtopology and geometry data is loaded as described in subsection 3.1.3, vertex coordi-nates are initially distributed independently of the cells, with process p getting some N ( p ) V × d vertex coordinates. For example, if we have 20 vertices in 3D and 4 pro-cesses, each process might get 15 doubles, namely the coordinates for 5 contiguouslynumbered vertices, regardless of the cells it has been assigned.In order to construct a parallel plex from these vertex and cell portions, we employ DMPlexCreateFromCellListParallel() . From the cell-vertex topology information,it determines how many unique local vertices a process owns using a hash table of thevertex numbers. The cone information for a local plex is constructed by translatingthe global vertex numbers to local vertex numbers, using the index of each localvertex in the hash table. Once we have the local cone information, we symmetrizethis to get support information (

DMPlexSymmetrize() ), and construct strata labels(

DMPlexStratify() ). Both these operations have linear complexity.Given this hash table of local vertices and the global division, we construct anSF mapping the initial vertex division to the local division, sfVert . With this SF,we broadcast coordinates from the initial distribution roots to the local distributionleaves, determined by the mesh topology. In addition, we can use the initial SF toconstruct an SF describing the sharing of local vertices between processes, sfPoint .We ﬁrst construct an array which holds ( r, l ) for each local vertex. Then we reduce thisarray to the roots using sfVert , taking the maximum over ranks. This information

ULLY PARALLEL MESH I/O USING PETSC DMPLEX sfPoint .The resulting distributed plex object has partition-wise complete topological in-formation. Each partition is a serial plex object which includes all incident verticesof its elements. These serial plex objects are combined together by the parallel SFobject sfPoint as shown in Figure 5.

Since the steps above lead to a dis-tributed plex, we need a parallel version of the topological interpolation (subsec-tion 2.2). This step appeared to be the most challenging one. It consists of the serialinterpolation (in-memory computations) and a small communication.The ﬁrst step consists in applying the sequential topological interpolation (subsec-tion 2.2) and cone orientation (subsection 2.3) on each rank independently. Then wemust alter the

PetscSF structure which identiﬁes mesh points which are owned by dif-ferent processes, or leaf points. The SF structure is described in subsection 3.2.1. Wemark all leaf points which are adjacent to another ghost point as candidates. Thesecandidate points are then gathered to root point owners (using

PetscSFBcast() ). Foreach candidate, the root checks for each point in the cone that either it owns thatpoint in the SF or it is a local point. If so, it claims ownership. These claims areagain broadcast, allowing a new SF to be created incorporating the new edges andfaces.The cone orientation has been done on each rank independently, and hence it isonly partition-wise correct. However, we have not yet handled the following assump-tion: If interface edges/faces owned by diﬀerent ranks represent the same geometricalentity, i.e., they are connected by pointSF like edges [0]5 and [1]6 in Figure 6, theymust have a conforming order of cone points ([ r ] means ownership by rank r ). Thisrequirement can be written more rigorously as an implication p → p C ( p ) = ( q , , . . . , q ,n − ) C ( p ) = ( q , , . . . , q ,n − )  ⇒  q , → q , · · · q ,n − → q ,n − , (3.1)using notation from subsection 2.3 and relation → meaning a connection via pointSF .In Figures 6(a) and 6(c) this assumption is violated for the edges [0]5 and [1]6.They are ﬂipped to each other, more rigorously speaking pointSF connects the edgeand its incident vertices[0]5 → [1]6 , [0]2 → [1]1 , [0]3 → [1]3 , but the order of cone points does not conform,[0]2 = C ([0]5 , (cid:54)→ C ([1]6 ,

0) = [1]3 , [0]3 = C ([0]5 , (cid:54)→ C ([1]6 ,

1) = [1]1 . This would lead to incorrect PDE solution if the used discretization method makesuse of the edge.In order to satisfy this requirement, and additional synchronization of the interfacecones must be carried out. We start by synchronization of the interface cone pointnumbering. Let us remind that pointSF is a one-sided structure, so only the originsof the arrows can be found directly.Let us assume rank r , its edge/face [ r ] p , and that there is a pointSF ar-row pointing from [ r ] p to some [ r ] p If we detect an arrow directed from the cone2

V. HAPLA ET AL. [0] [1] (a) mesh from Figure 5(b)with non-conforming cone orientation [0] [1] (b) mesh from Figure 5(b)with conforming cone orientation [0] [1]

01 0 (c) plex representation of (a); O ([1]0 ,

0) = 0 [0] [1]

00 1 (d) plex representation of (b); O ([1]0 ,

0) = − Fig. 6: Parallel DMPlex with cone points order and orientation.point C ([ r ] p, c ) → [ r ] q c , we set root ([ r ] p, c ) = ( r , q c ), otherwise root ([ r ] p, c ) =( r , C ([ r ] p, c )). This root ([ r ] p, c ) is sent to r using PetscSFBcastBegin/End() ,and stored at the destination rank as leaf ([ r ] p, c ). This is done for each rank, eachpoint in Stratum ( h ), h >

0, and each c = 0 , r view, it has for c = 0 , r ] p , root ([ r ] p, c ) andthe received leaf ([ r ] p, c ). If root ([ r ] p, c ) = leaf ([ r ] p, c ) does not hold for both c = 0 ,

1, we must rotate and/or ﬂip the cone so that this condition gets satisﬁed. Inthat case we must also update O ( s ) for all s ∈ S ([ r ] p ) accordingly to compensate thechange of cone order.We can see that the orientation synchronization heavily relies on the correct pointSF . This is why it must be processed ﬁrst. We now have a correct distributed

DMPlex instance rep-resenting all codimensions. However, the distribution is naive; it is load balanced withrespect to the size of partitions but the partition shape is not optimized. It can op-tionally be further improved using a parallel partitioner to minimize the partitioninterfaces and hence reduce the halo communication in the subsequent applicationcomputation. PETSc oﬀers interfaces to ParMETIS [29, 30, 31] and PT-Scotch [15].For this paper, we always used ParMETIS. We will not discuss this stage further be-cause it is not critically needed to overcome the single node memory limit (our mainmotivation), its cost-eﬀectiveness depends on the application, and it was implementedin PETSc beforehand.

4. Seismic wave propagation modeling.

As a representative use case andbenchmarking tool for the new parallel simulation startup phase described above in

ULLY PARALLEL MESH I/O USING PETSC DMPLEX

5. Performance results.

This section presents scalability tests of the new par-allel simulation startup phase (section 3) used within seismic wave propagation mod-eling (section 4).

All benchmarks were run at Piz Daint, the ﬂagship system ofthe Swiss National Supercomputing Centre (CSCS). Piz Daint consists of 5704 12-coreCray XC50 nodes with GPU accelerators, and 1813 36-core Cray XC40 nodes withoutaccelerators. All benchmarks presented in this paper ran on the XC50 nodes. Eachof them is equipped with one 12-core Intel Xeon E5-2690 v3 (Haswell) processor,one NVIDIA Tesla P100 16GB GPGPU accelerator and 64 GB RAM. Piz Dainthas 8.8 PB shared scratch storage with the peak performance of 112 GiB/s. It isimplemented using Cray Sonexion 3000 [16] scale-out storage system equipped withthe Lustre parallel ﬁle system [42], 40 object storage targets (OSTs) and 2 metadataservers (MDSs).

We always used thesingle shared ﬁle approach, i.e., every process reads its disjoint chunk from the commonﬁle. There are many good reasons for such choice, such as reduction of metadataaccesses, but the main reason is the ﬂexibility in number of processes using the sameﬁle, unlike the ﬁle-per-process approach. As for Lustre ﬁle system settings, we usedstripe count of 40 (maximum on Piz Daint) and stripe size of 16 MB. RegardingHDF5/MPI-IO, we always non-collective reading. We tested also collective readingwith various numbers of aggregators (MPI-IO hint cb_nodes ) but never saw anysigniﬁcant beneﬁt. The raw ﬁle reading always took less than 2 seconds; we cannotexclude that some scenarios with much bigger ﬁles and/or node counts could requiremore deliberate settings but such scenarios are irrelevant within the context of thispaper.

Our performance benchmark consists in elastic wavepropagation in homogeneous isotropic media from a point source in a cubic geometry.The cube is discretized into equally sized hexahedral cells, handled as an unstructuredmesh. Each cell hosts a 4-th order spectral element with 125 spatial DOFs. Since4

V. HAPLA ET AL. a 3-D vector equation was solved, this resulted in 1125 ﬁeld variables per element,together representing acceleration, velocity, and displacement. Figure 7 illustrates thesolution of the benchmark problem at two diﬀerent timesteps. Table 1 summarizesdimensions of the stored topology and geometry datasets and resulting ﬁle sizes fordiﬀerent numbers of elements in x-direction (NEX). Our sequence of NEX was chosenso that the total number of elements (NE) of each successive mesh is approximatelydoubled, starting at 8 million.We present performance of the new parallel startup phase in several graphs.Graphs in Figure 8 present strong scalability for the mesh size of 16 million elements,and serve mainly for comparison of the new parallel startup with the original serialstartup. 16 million is an upper bound for the mesh size for the serial startup imposedby the memory of a single Piz Daint Cray XC50 node. Overcoming this limit is forus the most important achievement of the startup phase parallelization. However,obviously the performance improvement is very signiﬁcant as well. At 1024 nodes,the serial startup takes an amount of wall time equivalent to 202’886 timesteps in thesubsequent time loop, whereas the parallel startup takes only 1827 timesteps, whichmeans 111x speedup. The number of timesteps in production simulations varies, butgenerally 30’000 or more are required.Graphs in Figure 9 are similar to Figure 8 but show the only the parallel startupscalability for various mesh sizes, so that the displayed time can be limited to 70seconds. These graph gather all stages in a single graph and diﬀerent mesh sizesare presented separately. By contrast, graphs in Figure 10 show strong and weakscalability for each stage separately, gathering all mesh sizes in a single graph. We cansee that the topological interpolation scales almost perfectly and becomes insigniﬁcantfor high number of nodes even for very large meshes. The other stages do not scalethat well; however, their absolute wall times are rather small for the mesh sizes andnode counts of interest. The scalability of the signiﬁcant redistribution stage breaksat about 256 nodes and the mesh size no more dictates the wall time. ParMetiswas used with default settings; there might be some space for slight improvement bytuning its parameters but in general it is well known that current graph partitionersdo not scale beyond 10’000 cores. We can hardly do anything about it apart fromperhaps testing alternative approaches such as space-ﬁlling curves. There might besome space for improvement left for the distributed plex construction; nevertheless,from the stagnation between 256 and 1024 nodes for the largest mesh we concludethat such optimization is probably not worth the eﬀort, at least for now.

6. Application: seismicity on Mars.

In late 2018, the NASA InSight mission[6] placed a highly sensitive seismometer [39] on Mars’ surface and recorded the ﬁrstseismic signals ever observed on Mars [21]. The observation of seismic waves is acrucial source of information to investigate the interior structure and compositionof Mars. However, as the data shows signiﬁcant diﬀerences to seismic data fromboth Moon and Earth, numerical simulations of seismic wave propagation on Marsthat account for topography as well as 3D scattering due to lateral variations of thematerial parameters are key to assist the interpretation of the observational data.Full waveform simulations are essential to constrain the planet’s structure usingdata in the frequency band recorded by the probe. The seismic response to marsquakesor asteroid impacts is governed by a coupled system of the elastic/acoustic waveequation, which models seismic waves propagating through Mars’ mantle and theliquid core, respectively. This can be simulated eﬃciently using the Spectral ElementMethod (SEM).

ULLY PARALLEL MESH I/O USING PETSC DMPLEX (a) t = 0.1 s (b) t = 0.15 s Fig. 7: Cube benchmark with a moment-tensor type source located at the center ofthe domain. Isotropic elastic material model without attenuation was applied. Themesh size was 256M elements, and 512 Piz Daint nodes (6144 CPUs, 512 GPUs) wereused for the computation. The normalized magnitude of the velocity vector at time t is visualized. We can see the P-wave propagating from the source in (a), while (b)shows the S-wave and reﬂections of the P-wave from the cube boundary. topology (int64) geometry (double)NEX rows = NE = (NEX) columns rows columns ﬁle size (GB)200 8’000’000 8 8’120’601 3 0.66252 16’003’008 8 16’194’277 3 1.32318 32’157’432 8 32’461’759 3 2.64400 64’000’000 8 64’481’201 3 5.26504 128’024’064 8 128’787’625 3 10.51635 256’047’875 8 257’259’456 3 21.01 Table 1: Cube benchmark dataﬁles: number of elements in x-direction (NEX); totalnumber of elements (NE); topology (connectivity) and geometry (vertices) datasetsizes; ﬁle sizes. Both topology (integer numbers) and geometry (real numbers) arestored with 64-bit precision.For the computation of the seismic response of Mars, we rely on Salvus’ imple-mentation of the SEM (section 4). Salvus’ internal mesher uses custom algorithmsto generate fully unstructured conforming 3D hexahedral meshes [49], eﬃciently rep-resenting topography and the extreme crustal thickness variations of Mars ( ∼ V. HAPLA ET AL. (a) serial startup

51 66 120 219 714 1'8270100200300400500600 32 64 128 256 512 1024 (b) parallel startup

Fig. 8:

Cube benchmark. Strong scalability, serial/parallel startup, 16 mil-lion mesh elements.

The serial and parallel simulation startup phase are comparedwith each other and 1000 steps of the Salvus time loop in terms of wall time. Approx-imate wall time for any diﬀerent number of timesteps can be obtained using simpleproportionality. The number of timesteps in production simulations varies, but gen-erally 30’000 or more are required. The particular startup stages are described insection 3.

X-axis : number of Piz Daint nodes, each equipped with 12 cores and 1GPU per node.

Y-axis : wall time in seconds.

Labels above bars : the number oftimesteps that take the same wall time as the startup phase.

Order of colors in thebars is the same as in the legend.dimension, results in the computational complexity of a simulation scaling with fre-quency to the power of 4. When using 4-th order spectral elements, which is commonfor planetary-scale wave propagation, more than 6 grid points per shortest wave-length are needed to accurately resolve seismic waves [19]. As the quakes are small inmagnitude and the noise level increases at low frequencies, large-scale simulations arerequired to reach the parameter regime of the observations, and the required numberof spectral elements can easily reach hundreds of millions.Such mesh sizes necessitate the parallel simulation startup presented in section 3.Prior to these developments, the largest possible mesh size was limited by the availablememory of a single Piz Daint node to approximately 16 million elements. Moreover,loading a mesh of such size took more than four minutes. With the parallel startupin hand, these limitations vanish.Figure 11(d) shows a snapshot of the surface displacement resulting from a sim-ulation of a hypothetical quake on Mars. Here, the discretized coupled elastic waveequation has approximately 124 million 4-th order spectral elements, and we compute100’455 timesteps representing a simulated time of 30 minutes. Each element has 125spatial DOFs, each hosting 9 dynamic ﬁeld components (vector displacement, veloc-ity, and acceleration). These parameters lead to an unprecedented resolved periodof 3.2 s. Using again all 12 cores per Piz Daint node as well as the attached TeslaP100 GPU, this simulation took approximately 2.4 hours of wall time on 256 PizDaint nodes. From this total wall time, raw data loading took 0.9 s, distributed plexconstruction 4.5 s, topological interpolation 3.1 s and redistribution 2.3 s, i.e., thewhole startup phase took less than 11 s.

ULLY PARALLEL MESH I/O USING PETSC DMPLEX

61 87 158 408 1'069 2'259010203040506070 32 64 128 256 512 1024 (a) 8 million elements

51 66 120 219 714 1'827010203040506070 32 64 128 256 512 1024 (b) 16 million elements

63 102 214 480 1'143010203040506070 64 128 256 512 1024 (c) 32 million elements

93 168 355 964010203040506070 128 256 512 1024 (d) 64 million elements

139 242 628010203040506070 256 512 1024 (e) 128 million elements

226 425010203040506070 512 1024 (f) 256 million elements

Fig. 9:

Cube benchmark. Strong scalability, parallel startup, various meshsizes.

The parallel simulation startup phase is compared with 1000 steps of the Salvustime loop in terms of wall time. Approximate wall time for any number of timestepscan be obtained using simple proportionality. The number of timesteps in productionsimulations varies, but generally 30’000 or more are required. The particular stagesare described in section 3.

X-axis : number of Piz Daint nodes, each with 12 cores and1 GPU.

Y-axis: wall time in seconds.

Labels above bars: the number of timestepsthat take the same wall time as the startup phase.

Missing bars: out of memoryfailure during the time loop caused by the memory limit of the GPUs.

Order ofcolors in the bars is the same as in the legend.

Note: (a) is the same as Figure 8(b)but with a re-scaled time axis.8

V. HAPLA ET AL. (a) Raw data loading (b) Distributed plex construction (c) Topological interpolation (d) Redistribution (e) Total startup (f) 1000 Salvus timesteps

Fig. 10:

Cube benchmark. Strong and weak scalability per stage. X-axis :number of Piz Daint nodes (log2 scale), each with 12 cores and 1 GPU; 0 . / Y-axis: wall time inseconds (log2 scale).

Solid lines show the strong scalability for diﬀerent mesh sizes.

Dashed line shows the weak scalability for the mesh size of approximately 40’000elements per core which is the upper bound for the Salvus timeloop imposed by limitedGPU memory.

Order of line styles is the same in the plots and in the legends.

ULLY PARALLEL MESH I/O USING PETSC DMPLEX (a) cubed sphere with ﬂuid core (blue) and solid mantle (red)(b) modeling 3D topography on the surface(c) crustal thickness variations (red)(d) snapshot of the seismic waveﬁeld on the surface (result of the simulation) Fig. 11: Anisotropic mesh reﬁnements to accurately model the structure of Mars ona conforming hexahedral mesh. The Mars texture map is based on NASA elevationand imagery data.0

V. HAPLA ET AL.

7. Conclusions.

We presented algorithmic strategies for handling unstructuredmeshes for high-order ﬁnite-element simulations on complex domains. In particular,we demonstrated new capabilities for parallel mesh reading and topological interpo-lation in PETSc DMPlex, which enables a fully parallel workﬂow starting from theinitial data ﬁle reading. This is beneﬁcial not only for direct users of DMPlex butalso users of software libraries and packages employing DMPlex, such as Firedrake[45], Salvus [1], or HPDDM [27].This work in a sense follows up [35] and addresses the main task stated in theirFuture Work: “Most crucially perhaps is the development of a fully parallel mesh inputreader in PETSc in order to overcome the remaining sequential bottleneck duringmodel initialisation.”

Moreover, that paper mentions the “HDF5-based XDMFoutput format” but it has become an input format as well within this work. Hence,HDF5/XDMF has become the ﬁrst widely used mesh format supported by PETScwhich is both readable and writable in parallel.The implementation is agnostic to the type of ﬁnite elements in the mesh andcompletely decoupled from the governing equations, and is thus applicable in manyscientiﬁc disciplines. In particular, our solution overcomes bottlenecks in numericalmodeling of seismic wave propagation on Mars and shows excellent parallel scalabilityin large-scale simulations on more than 12’000 cores.

Acknowledgments.

All presented PETSc developments have been made pub-licly available in PETSc since its release 3.13.We gratefully acknowledge support from the Swiss National Supercomputing Cen-tre (CSCS) under projects s922 and s961; the Platform for Advanced Scientiﬁc Com-puting (PASC) under the project “Salvus”; the Swiss National Science Foundation(SNF) BRIDGE fellowship program under the grant agreement No. 175322; the Eu-ropean Research Council (ERC) from the EU’s Horizon 2020 programme under grantagreement No. 714069; and the ETH Zurich Postdoctoral Fellowship Program whichin turn received funding from the EUs Seventh Framework Programme under thegrant agreement No. 608881.

REFERENCES[1]

M. Afanasiev, C. Boehm, M. vanDriel, L. Krischer, M. Rietmann, D. A. May, M. G.Knepley, and A. Fichtner , Modular and ﬂexible spectral-element waveform modelling intwo and three dimensions , Geophysical Journal International, 216 (2019), pp. 1675–1692,https://doi.org/10.1093/gji/ggy469.[2]

R. Anderson, J. Andrej, A. Barker, J. Bramwell, J.-S. Camier, J. Cerveny, V. Dobrev,Y. Dudouit, A. Fisher, T. Kolev, W. Pazner, M. Stowell, V. Tomov, J. Dahm,D. Medina, and S. Zampini , Mfem: a modular ﬁnite element methods library , 2019,https://arxiv.org/abs/1911.09220.[3]

S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin,A. Dener, V. Eijkhout, W. D. Gropp, D. Karpeyev, D. Kaushik, M. G. Knepley,D. A. May, L. C. McInnes, R. T. Mills, T. Munson, K. Rupp, P. Sanan, B. F. Smith,S. Zampini, H. Zhang, and H. Zhang , PETSc web page

S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dal-cin, A. Dener, V. Eijkhout, W. D. Gropp, D. Karpeyev, D. Kaushik, M. G. Kne-pley, D. A. May, L. C. McInnes, R. T. Mills, T. Munson, K. Rupp, P. Sanan,B. F. Smith, S. Zampini, H. Zhang, and H. Zhang , PETSc users manual

S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith , Eﬃcient management of parallel-ism in object oriented numerical software libraries , in Modern Software Tools in ScientiﬁcULLY PARALLEL MESH I/O USING PETSC DMPLEX Computing, E. Arge, A. M. Bruaset, and H. P. Langtangen, eds., Birkh¨auser Press, 1997,pp. 163–202, https://doi.org/10.1007/978-1-4612-1986-6 8.[6]

W. B. Banerdt, S. E. Smrekar, D. Banfield, D. Giardini, M. Golombek, C. L. John-son, P. Lognonn´e, A. Spiga, T. Spohn, C. Perrin, S. C. St¨ahler, D. Antonangeli,S. Asmar, C. Beghein, N. Bowles, E. Bozdag, P. Chi, U. Christensen, J. Clinton,G. S. Collins, I. Daubar, V. Dehant, M. Drilleau, M. Fillingim, W. Folkner, R. F.Garcia, J. Garvin, J. Grant, M. Grott, J. Grygorczuk, T. Hudson, J. C. E. Irving,G. Kargl, T. Kawamura, S. Kedar, S. King, B. Knapmeyer-Endrun, M. Knapmeyer,M. Lemmon, R. Lorenz, J. N. Maki, L. Margerin, S. M. McLennan, C. Michaut,D. Mimoun, A. Mittelholz, A. Mocquet, P. Morgan, N. T. Mueller, N. Murdoch,S. Nagihara, C. Newman, F. Nimmo, M. Panning, W. T. Pike, A.-C. Plesa, S. Ro-driguez, J. A. Rodriguez-Manfredi, C. T. Russell, N. Schmerr, M. Siegler, S. Stan-ley, E. Stutzmann, N. Teanby, J. Tromp, M. van Driel, N. Warner, R. Weber, andM. Wieczorek , Initial results from the InSight mission on Mars , Nat. Geosci., 13 (2020),pp. 183–189, https://doi.org/10.1038/s41561-020-0544-y.[7]

N. Barral, M. G. Knepley, M. Lange, M. D. Piggott, and G. J. Gorman , Anisotropicmesh adaptation in ﬁredrake with petsc dmplex , 2016, https://arxiv.org/abs/1610.09874.[8]

J. Brown , Star forests as a parallel communication model , 2011, https://jedbrown.org/ﬁles/StarForest.pdf (accessed 2020-03-17).[9]

J. Brown, M. G. Knepley, D. A. May, L. C. McInnes, and B. F. Smith , Composable linearsolvers for multiphysics , in Proceeedings of the 11th International Symposium on Paralleland Distributed Computing (ISPDC 2012), IEEE Computer Society, 2012, pp. 55–62,https://doi.org/10.1109/ISPDC.2012.16.[10]

J. Brown, M. G. Knepley, and B. F. Smith , Run-time extensibility and librarization ofsimulation software , Computing in Science Engineering, 17 (2015), pp. 38–45, https://doi.org/10.1109/MCSE.2014.95.[11]

P. R. Brune, M. G. Knepley, and L. R. Scott , Unstructured geometric multigrid in two andthree dimensions on complex and graded meshes , SIAM Journal on Scientiﬁc Computing, 35(2013), pp. A173–A191, https://doi.org/10.1137/110827077, https://arxiv.org/abs/1104.0261.[12]

C. Burstedde, O. Ghattas, M. Gurnis, T. Isaac, G. Stadler, T. Warburton, andL. Wilcox , Extreme-scale AMR , in SC ’10: Proceedings of the 2010 ACM/IEEE Inter-national Conference for High Performance Computing, Networking, Storage and Analysis,2010, pp. 1–12, https://doi.org/10.1109/SC.2010.25.[13]

C. Burstedde, L. C. Wilcox, and O. Ghattas , p4est : Scalable algorithms for parallel adap-tive mesh reﬁnement on forests of octrees , SIAM Journal on Scientiﬁc Computing, 33(2011), pp. 1103–1133, https://doi.org/10.1137/100791634.[14] H. Briot, A. Prinn, and G. Gabard , Eﬃcient implementation of high-order ﬁnite elementsfor Helmholtz problems , International Journal for Numerical Methods in Engineering, 106(2016), pp. 213–240, https://doi.org/10.1002/nme.5172.[15]

C. Chevalier and F. Pellegrini , PT-Scotch: A tool for eﬃcient parallel graph ordering ,Parallel Computing, 34 (2008), pp. 318–331, https://doi.org/10.1016/j.parco.2007.12.001.[16]

Cray Inc. , Cray Sonexion 3000 Storage System

P. E. Farrell, L. Mitchell, and F. Wechsung , An augmented Lagrangian preconditionerfor the 3D stationary incompressible Navier–Stokes equations at high Reynolds number ,SIAM Journal on Scientiﬁc Computing, 41 (2019), pp. A3073–A3096, https://doi.org/10.1137/18M1219370.[18]

A. Ferroni, P. Antonietti, I. Mazzieri, and A. Quarteroni , Dispersion-dissipation analy-sis of 3-D continuous and discontinuous spectral element methods for the elastodynamicsequation , Geophys. J. Int., 211 (2017), pp. 1554–1574, https://doi.org/10.1093/gji/ggx384.[19]

A. Fichtner , Full Seismic Waveform Modelling and Inversion , Advances in Geophysical andEnvironmental Mechanics and Mathematics, Springer Berlin Heidelberg, 2011, https://doi.org/10.1007/978-3-642-15807-0.[20]

J. E. Flaherty, R. M. Loy, M. S. Shephard, B. K. Szymanski, J. D. Teresco, and L. H.Ziantz , Adaptive local reﬁnement with octree load balancing for the parallel solution ofthree-dimensional conservation laws , Journal of Parallel and Distributed Computing, 47(1997), pp. 139–152, https://doi.org/10.1006/jpdc.1997.1412.[21]

D. Giardini, P. Lognonn´e, W. B. Banerdt, W. T. Pike, U. Christensen, S. Ceylan,J. F. Clinton, M. van Driel, S. C. St¨ahler, M. B¨ose, R. F. Garcia, A. Khan,M. Panning, C. Perrin, D. Banfield, E. Beucler, C. Charalambous, F. Euch-ner, A. Horleston, A. Jacob, T. Kawamura, S. Kedar, G. Mainsant, J.-R. Scholz, V. HAPLA ET AL.

S. E. Smrekar, A. Spiga, C. Agard, D. Antonangeli, S. Barkaoui, E. Barrett,P. Combes, V. Conejero, I. Daubar, M. Drilleau, C. Ferrier, T. Gabsi, T. Gud-kova, K. Hurst, F. Karakostas, S. King, M. Knapmeyer, B. Knapmeyer-Endrun,R. Llorca-Cejudo, A. Lucas, L. Luno, L. Margerin, J. B. McClean, D. Mimoun,N. Murdoch, F. Nimmo, M. Nonon, C. Pardo, A. Rivoldini, J. A. R. Manfredi,H. Samuel, M. Schimmel, A. E. Stott, E. Stutzmann, N. Teanby, T. Warren, R. C.Weber, M. Wieczorek, and C. Yana , The seismicity of Mars , Nat. Geosci., 13 (2020),pp. 205–212, https://doi.org/10.1038/s41561-020-0539-8.[22]

HDF5 Group , HDF5 , https://portal.hdfgroup.org/display/HDF5/HDF5 (accessed 2020-03-20).[23]

HDF5 Group , Parallel HDF5 , https://portal.hdfgroup.org/display/HDF5/Parallel+HDF5(accessed 2020-03-20).[24]

T. Hughes , The Finite Element Method: Linear Static and Dynamic Finite Element Analysis ,vol. 78, Dover Publications, 2000.[25]

T. Isaac, C. Burstedde, L. C. Wilcox, and O. Ghattas , Recursive algorithms for distributedforests of octrees , SIAM Journal on Scientiﬁc Computing, 37 (2015), pp. C497–C531, https://doi.org/10.1137/140970963.[26]

T. Isaac and M. G. Knepley , Support for non-conformal meshes in PETSc’s DMPlex inter-face , 2015, https://arxiv.org/abs/1508.02470.[27]

P. Jolivet, F. Nataf, et al. , HPDDM – high-performance uniﬁed framework for domaindecomposition methods , https://github.com/hpddm/hpddm (accessed 2020-04-18).[28]

G. Karypis et al. , METIS - serial graph partitioning and ﬁll-reducing matrix ordering , http://glaros.dtc.umn.edu/gkhome/metis/metis/overview (accessed 2020-03-17).[29]

G. Karypis et al. , ParMETIS - parallel graph partitioning and ﬁll-reducing matrix ordering ,http://glaros.dtc.umn.edu/gkhome/metis/parmetis/overview (accessed 2020-03-17).[30]

G. Karypis and V. Kumar , A fast and high quality multilevel scheme for partitioning irregulargraphs , SIAM Journal on Scientiﬁc Computing, 20 (1998), pp. 359–392, https://doi.org/10.1137/S1064827595287997.[31]

G. Karypis and V. Kumar , A parallel algorithm for multilevel graph partitioning and sparsematrix ordering , Journal of Parallel and Distributed Computing, 48 (1998), pp. 71–95,https://doi.org/10.1006/jpdc.1997.1403.[32]

M. G. Knepley and D. A. Karpeev , Mesh algorithms for PDE with Sieve I: Mesh distribution ,Scientiﬁc Programming, 17 (2009), pp. 215–230, https://doi.org/10.3233/SPR-2009-0249.[33]

M. G. Knepley, M. Lange, and G. J. Gorman , Unstructured overlapping mesh distributionin parallel , 2017, https://arxiv.org/abs/1506.06194.[34]

T. Krn, S. C. Kramer, L. Mitchell, D. A. Ham, M. D. Piggott, and A. M. Bap-tista , Thetis coastal ocean model: discontinuous Galerkin discretization for the three-dimensional hydrostatic equations , Geoscientiﬁc Model Development, 11 (2018), pp. 4359–4382, https://doi.org/10.5194/gmd-11-4359-2018.[35]

M. Lange, M. G. Knepley, and G. J. Gorman , Flexible, scalable mesh and data manage-ment using PETSc DMPlex , in Proceedings of the 3rd International Conference on ExascaleApplications and Software, EASC ’15, Edinburgh, Scotland, UK, 2015, University of Edin-burgh, pp. 71–76, http://dl.acm.org/citation.cfm?id=2820083.2820097 (accessed 2020-03-30).[36]

M. Lange, L. Mitchell, M. G. Knepley, and G. J. Gorman , Eﬃcient mesh managementin Firedrake using PETSc DMPlex , SIAM Journal on Scientiﬁc Computing, 38 (2016),pp. S143–S155, https://doi.org/10.1137/15M1026092.[37]

M. G. Larson and F. Bengzon , The ﬁnite element method: theory, implementation, andapplications , no. 10 in Texts in computational science and engineering, Springer, 2013,https://doi.org/10.1007/978-3-642-33287-6.[38]

A. Logg , Eﬃcient representation of computational meshes , International Journal of Compu-tational Science and Engineering, 4 (2009), pp. 283–295, https://doi.org/10.1504/IJCSE.2009.029164.[39]

P. H. Lognonn´e, W. B. Banerdt, D. Giardini, W. T. Pike, U. Christensen, P. Laudet,S. de Raucourt, P. Zweifel, S. Calcutt, M. Bierwirth, K. J. Hurst, F. Ijpelaan,J. W. Umland, R. Llorca-Cejudo, S. A. Larson, R. F. Garcia, S. Kedar,B. Knapmeyer-Endrun, D. Mimoun, A. Mocquet, M. P. Panning, R. C. Weber,A. Sylvestre-Baron, G. Pont, N. Verdier, L. Kerjean, L. J. Facto, V. Gharaka-nian, J. E. Feldman, T. L. Hoffman, D. B. Klein, K. Klein, N. P. Onufer, J. Paredes-Garcia, M. P. Petkov, J. R. Willis, S. E. Smrekar, M. Drilleau, T. Gabsi, T. Nebut,O. Robert, S. Tillier, C. Moreau, M. Parise, G. Aveni, S. Ben Charef, Y. Bennour,T. Camus, P. A. Dandonneau, C. Desfoux, B. Lecomte, O. Pot, P. Revuz, D. Mance,

ULLY PARALLEL MESH I/O USING PETSC DMPLEX J. TenPierick, N. E. Bowles, C. Charalambous, A. K. Delahunty, J. Hurley, R. Ir-shad, H. Liu, A. G. Mukherjee, I. M. Standley, A. E. Stott, J. Temple, T. Warren,M. Eberhardt, A. Kramer, W. K¨uhne, E.-P. Miettinen, M. Monecke, C. Aicardi,M. Andr´e, J. Baroukh, A. Borrien, A. Bouisset, P. Boutte, K. Brethom´e, C. Brys-baert, T. Carlier, M. Deleuze, J. M. Desmarres, D. Dilhan, C. Doucet, D. Faye,N. Faye-Refalo, R. Gonzalez, C. Imbert, C. Larigauderie, E. Locatelli, L. Luno, J.-R. Meyer, F. Mialhe, J. M. Mouret, M. Nonon, Y. Pahn, A. Paillet, P. Pasquier,G. Perez, R. Perez, L. Perrin, B. Pouilloux, A. Rosak, I. Savin de Larclause,J. Sicre, M. Sodki, N. Toulemont, B. Vella, C. Yana, F. Alibay, O. M. Avalos,M. A. Balzer, P. Bhandari, E. Blanco, B. D. Bone, J. C. Bousman, P. Bruneau,F. J. Calef, R. J. Calvet, S. A. D’Agostino, G. de los Santos, R. G. Deen, R. W.Denise, J. Ervin, N. W. Ferraro, H. E. Gengl, F. Grinblat, D. Hernandez, M. Het-zel, M. E. Johnson, L. Khachikyan, J. Y. Lin, S. M. Madzunkov, S. L. Marshall,I. G. Mikellides, E. A. Miller, W. Raff, J. E. Singer, C. M. Sunday, J. F. Vil-lalvazo, M. C. Wallace, D. Banfield, J. A. Rodriguez-Manfredi, C. T. Russell,A. Trebi-Ollennu, J. N. Maki, ´E. Beucler, M. B¨ose, C. Bonjour, J. L. Berenguer,S. Ceylan, J. F. Clinton, V. Conejero, I. J. Daubar, V. Dehant, P. Delage, F. Eu-chner, I. Est`eve, L. Fayon, L. Ferraioli, C. L. Johnson, J. Gagnepain-Beyneix,M. Golombek, A. Khan, T. Kawamura, B. Kenda, P. Labrot, N. Murdoch, C. Pardo,C. Perrin, L. Pou, A. Sauron, D. Savoie, S. C. St¨ahler, ´E. Stutzmann, N. A. Teanby,J. Tromp, M. van Driel, M. A. Wieczorek, R. Widmer-Schnidrig, and J. Wookey , SEIS: Insight’s seismic experiment for internal structure of Mars , Space Sci. Rev., 215(2019), p. 12, https://doi.org/10.1007/s11214-018-0574-6.[40]

Message Passing Interface Forum , MPI: A Message-Passing Interface Standard. Version3.1

T. Nissen-Meyer, A. Fournier, and F. A. Dahlen , A two-dimensional spectral-elementmethod for computing spherical-earth seismograms - II. Waves in solid-ﬂuid media. , Geo-phys. J. Int., 174 (2008), pp. 873–888.[42]

Oracle and Intel Corporation , Lustre Software Release 2.x - Operations Manual , 2017,http://doc.lustre.org/lustre manual.pdf (accessed 2020-03-20).[43]

A. T. Patera , A spectral element method for ﬂuid dynamics: Laminar ﬂow in a channelexpansion , Journal of Computational Physics, 54 (1984), pp. 468–488.[44]

D. Peter, D. Komatitsch, Y. Luo, R. Martin, N. Le Goff, E. Casarotti, P. Le Loher,F. Magnoni, Q. Liu, C. Blitz, T. Nissen-Meyer, P. Basini, and J. Tromp , Forward andadjoint simulations of seismic wave propagation on fully unstructured hexahedral meshes ,Geophys. J. Int., 186 (2011), pp. 721–739.[45]

F. Rathgeber, D. A. Ham, L. Mitchell, M. Lange, F. Luporini, A. T. T. Mcrae, G.-T. Bercea, G. R. Markall, and P. H. J. Kelly , Firedrake: Automating the ﬁniteelement method by composing abstractions , ACM Trans. Math. Softw., 43 (2016), https://doi.org/10.1145/2998441.[46]

J. Rudi, A. C. I. Malossi, T. Isaac, G. Stadler, M. Gurnis, P. W. J. Staar, Y. Ineichen,C. Bekas, A. Curioni, and O. Ghattas , An extreme-scale implicit solver for complexPDEs: highly heterogeneous ﬂow in earth’s mantle , in Proceedings of the InternationalConference for High Performance Computing, Networking, Storage and Analysis, SC ’15,Association for Computing Machinery, 2015, pp. 1–12, https://doi.org/10.1145/2807591.2807675.[47]

R. Schneiders , Octree-based hexahedral mesh generation , International Journal of Com-putational Geometry & Applications, 10 (2000), pp. 383–398, https://doi.org/10.1142/S021819590000022X.[48]

T. J. Tautges, C. Ernst, C. Stimpson, R. J. Meyers, and K. Merkley , MOAB: a mesh-oriented database , Tech. Report SAND2004-1592, Sandia National Laboratories, 2004,https://doi.org/10.2172/970174.[49]

M. van Driel, C. Boehm, L. Krischer, and M. Afanasiev , Accelerating numerical wave prop-agation using waveﬁeld adapted meshes. Part I: forward and adjoint modelling , Geophysi-cal Journal International, 221 (2020), pp. 1580–1590, https://doi.org/10.1093/gji/ggaa058.[50]

XDMF – eXtensible Data Model and Format

O. Zienkiewicz, R. Taylor, and J. Zhu , eds.,