The PetscSF Scalable Communication Layer
Junchao Zhang, Jed Brown, Satish Balay, Jacob Faibussowitsch, Matthew Knepley, Oana Marin, Richard Tran Mills, Todd Munson, Barry F. Smith, Stefano Zampini
11 The PetscSF Scalable Communication Layer
Junchao Zhang ∗ , Jed Brown † , Satish Balay ∗ , Jacob Faibussowitsch ‡ , Matthew Knepley § , Oana Marin ∗ ,Richard Tran Mills ∗ , Todd Munson ∗ , Barry F. Smith ¶ , Stefano Zampini (cid:107)∗ Argonne National Laboratory, Lemont, IL 60439 USA † University of Colorado Boulder, Boulder, CO 80302 USA ‡ University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA § University at Buffalo, Buffalo, NY 14260 USA ¶ Argonne Associate of Global Empire, LLC, Argonne National Laboratory, Lemont, IL 60439 USA (cid:107)
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia (cid:70)
Abstract —PetscSF, the communication component of the Portable,Extensible Toolkit for Scientific Computation (PETSc), is being usedto gradually replace the direct MPI calls in the PETSc library. PetscSFprovides a simple application programming interface (API) for managingcommon communication patterns in scientific computations by using astar-forest graph representation. PetscSF supports several implementa-tions whose selection is based on the characteristics of the applicationor the target architecture. An efficient and portable model for networkand intra-node communication is essential for implementing large-scaleapplications. The Message Passing Interface, which has been the defacto standard for distributed memory systems, has developed into alarge complex API that does not yet provide high performance on theemerging heterogeneous CPU-GPU-based exascale systems. In thispaper, we discuss the design of PetscSF, how it can overcome somedifficulties of working directly with MPI with GPUs, and we demonstrateits performance, scalability, and novel features.
Index Terms —Communication, GPU, extreme-scale, MPI, PETSc
NTRODUCTION D ISTRIBUTED memory computation is practical andscalable; all high-end supercomputers today are dis-tributed memory systems. Orchestrating the communicationbetween processes with separate memory spaces is an essen-tial part of programming such systems. Programmers needinterprocess communication to coordinate the processes’work, distribute data, determine data dependencies, andbalance workloads. The Message Passing Interface (MPI)is the de facto standard for communication on distributedmemory systems. Parallel applications and libraries in sci-entific computing are predominantly programmed in MPI.However, writing communication code directly with MPIprimitives, especially in applications with irregular com-munication patterns, is difficult, time-consuming, and notthe primary interest of application developers. Also, MPIhas not yet fully adjusted to heterogeneous CPU-GPU-basedsystems where avoiding expensive CPU-GPU synchroniza-tions is crucial to achieving high performance.Higher-level communication libraries tailored for spe-cific families of applications can significantly reduce the programming burden from the direct use of MPI. Require-ments for such libraries include an easy-to-understand com-munication model, scalability, and efficient implementationswhile supporting a wide range of communication scenarios.When programming directly with MPI primitives, one facesa vast number of options, for example, MPI two-sided vs.one-sided communication, individual sends and receives,neighborhood operations, persistent or non-persistent inter-faces, and more.In [1] we discuss the plans and progress in adaptingthe Portable, Extensible Toolkit for Scientific Computationand Toolkit for Advanced Optimization [2] (PETSc) to CPU-GPU systems. This paper focuses specifically on the plansfor managing the network and intra-node communicationwithin PETSc. PETSc has traditionally used MPI calls di-rectly in the source code. PetscSF is the communicationcomponent we are implementing to gradually remove thesedirect MPI calls. Like all PETSc components, PetscSF isdesigned to be used both within the PETSc libraries andalso by PETSc users. It is not a drop-in replacement forMPI. Though we focus our discussion on GPUs in thispaper, PetscSF also supports CPU-based systems with highperformance. Most of our experimental work has been doneon the OLCF IBM/NVIDIA Summit system that serves asa surrogate for the future exascale systems; however, asdiscussed in [1], the PetscSF design and work is focusedon the emerging exascale systems.PetscSF can do communication on any MPI datatype andsupports a rich set of operations. Recently, we consolidatedPETSc’s vector scatters and PetscSF by implementing thescatters using PetscSF. We now have a single communicationcomponent in PETSc, thus making code maintenance easier.We added various optimizations to PetscSF, provided mul-tiple implementations, and, most importantly, added GPUsupport. Notably, we added support of NVIDIA NVSH-MEM [3] to provide an MPI alternative for communicationon CUDA devices.The paper is organized as follows. Section 2 discussesrelated work on abstracting communications on distributedmemory systems. Section 3 introduces PetscSF. In Section a r X i v : . [ c s . D C ] F e b
4, we describe several examples of usage of PetscSF withinPETSc itself. In Section 5, we detail the PetscSF implementa-tions and optimizations. Section 6 gives some experimentalresults and shows PetscSF’s performance and scalability.Section 7 concludes the paper with a summary and lookto future work.
ELATED W ORK
Because communication is such an integral part ofdistributed-memory programming, many regard the com-munication model as the programming model. We com-pare alternative distributed memory programming modelsagainst the predominantly used MPI.In its original (and most commonly used) form, MPIuses a peer-to-peer communication model, where both thesender and the receiver are explicitly involved in the com-munication. Programmers directly using MPI must managethese relationships, in addition to allocating staging buffers,determining which data to send, and finally packing and/orunpacking data as needed. These tasks can be difficultwhen information about which data is shared with whichprocesses is not explicitly known. Significant effort has beenmade to overcome this drawback with mixed results.The High Performance Fortran (HPF) [4] project allowedusers to write data distribution directives for their arraysand then planned for the compilers to determine the neededcommunication. HPF failed because compilers, even today,are not powerful enough to do this with indirectly indexedarrays.Several Partitioned Global Address Space (PGAS) lan-guages were developed, such as UPC [5], Co-array For-tran [6], Chapel [7], and OpenSHMEM [8]. They provideusers an illusion of shared memory. Users can dereferenceglobal pointers, access global arrays, or put/get remote datawithout the remote side’s explicit participation. Motivatedby these ideas, MPI added one-sided communication inMPI-2.0 and further enhanced it in MPI-3.0. However, thePGAS model had limited success. Codes using such modelsare error-prone, since shared-memory programming easilyleads to data race conditions and deadlocks. These are diffi-cult to detect, debug, and fix. Without great care, PGAS ap-plications are prone to low performance since programmerscan easily write fine-grained communication code, whichseverely hurts performance on distributed memory systems.Because of this, writing a correct and efficient simulationthat requires communication using PGAS languages is notfundamentally easier than programming in MPI.While MPI has excellent support for writing librariesand many MPI-based libraries target specific domains, fewcommunication libraries are built using MPI. We surmisethe reason for this is that the data to be communicated isusually embedded in the user’s data structure, and there isno agreed-upon interface to describe the user’s data connec-tivity graph. Zoltan [9] is one of the few such libraries thatis built on MPI. Zoltan is a collection of data managementservices for parallel, unstructured, adaptive, and dynamicapplications. Its unstructured communication service anddata migration tools are close to those of PetscSF. Users needto provide pack and unpack callback functions to Zoltan.Not only does PetscSF pack and unpack automatically, it is Fig. 1: Examples of stars, the union of which forms a starforest. Root vertices are identified with circles and leaveswith rectangles. Note that roots with no leaves are allowedas well as leaves with no root.capable of performing reductions when unpacking. PetscSFsupports composing communication graphs, see Section 3.3.The gs library [10] used by Nek5000 [11] gathers andscatters vector entries. It is similar to PETSc’s scatters, butit is slightly more general and supports communicatingmultiple data types. DIY [12] is a block-parallel communi-cation library for implementing scalable algorithms that canexecute both in-core and out-of-core. Using a block paral-lelism abstraction, it combines distributed-memory messagepassing with shared-memory thread parallelism. Complexcommunication patterns can be built on a graph of blocks,usually coarse-grained, whereas in PetscSF, graph verticescan be as small as a single floating-point number. We areunaware that any of these libraries support GPUs.Since distributed-memory communication systems areoften conflated with programming models, we should clar-ify that tools such as Kokkos [13] and Raja [14] providefunctionality orthogonal to communication systems. For ex-ample, in [1], we provide a prototype code that uses PetscSFfor the communication and Kokkos as the portable pro-gramming model. SYCL [15] automatically manages neededcommunication between GPU and CPU but does not havegeneral communication abilities. Thus, it also requires adistributed memory communication system. HE P ETSC
SF I
NTERFACE
PetscSF has a simple conceptual model with a small, butpowerful, interface. We begin by introducing how to createPetscSF objects and their primary use cases.
PetscSF supports both structured and irregular communi-cation graphs, with the latter being the focus of this paper.A star forest is used to describe the communication graphs.Stars are simple trees consisting of one root vertex connectedto zero or more leaves. The number of leaves is called the degree of the root. We also allow leaves without connectedroots. These isolated leaves represent holes in the user’s datastructure that do not participate in the communication. Fig.1 shows some example stars. A union of disjoint stars iscalled a star forest (SF). PetscSFs are typically partitionedacross multiple processes. In graph terminology, an PetscSFobject can be regarded as being similar to a quotient graph[16], which embeds the closely connected subpartitions of alarger graph, which is the abstraction of a domain topologi-cal representation.Following PETSc’s object creation convention, onecreates a PetscSF with
PetscSFCreate(MPI_Comm comm,PetscSF *sf) , where comm specifies the MPI com-municator the PetscSF lives on. One then describes thegraph by calling
PetscSFSetGraph() on each process. typedef struct {PetscInt rank,offset;} PetscSFNode;PetscErrorCode PetscSFSetGraph(PetscSF sf, PetscInt nroots,PetscInt nleaves, const PetscInt *local,PetscCopyMode localmode, const PetscSFNode *remote,PetscCopyMode remotemode);
This call indicates that this process owns nroots roots and nleaves connected leaves. The roots are numbered from 0to nroots-1 . local[0..nleaves] contains local indicesof those connected leaves. If NULL is passed for local , theleaf space is contiguous. Addresses of the roots for eachleaf are specified by remote[0..nleaves] . Leaves canconnect to local or remote roots; therefore we use tuples (rank,offset) to represent root addresses, where rank is the MPI rank of the owner process of a root and offset isthe index of the root on that process. Fig. 2 shows an PetscSF
Roots
Leaves local remote
Fig. 2: Distributed star forest partitioned across three pro-cesses, with the specification arrays at right. Roots (leaves)on each rank are numbered top-down, with local indicesstarting from zero. Note the isolated vertices on rank 1.with data denoted at vertices, together with the local and remote arguments passed to
PetscSFSetGraph() .The edges of an PetscSF are specified by the process thatowns the leaves; therefore a given root vertex is merelya candidate for incoming edges. This one-sided specifica-tion is important to encode graphs containing roots witha very high degree, such as globally coupled constraints,in a scalable way.
PetscSFSetUp(PetscSF sf) sets upinternal data structures for its implementations and checksfor all possible optimizations. The setup cost is usuallyamortized by multiple operations performed on the PetscSF.All PetscSF routines return an error code; we omit it in theirprototypes for brevity.
In the following, we demonstrate operations that updatedata on vertices. All operations are split into matching begin and end phases, taking the same arguments. Users can insertindependent computations between the calls to overlapcomputation and communication. Data buffers must notbe altered between the two phases, and the content of theresult buffer can be accessed only after the matching Endoperation has been posted. (1) Broadcast root values to leaves:
PetscSFBcastBegin/End(PetscSF sf, MPI_Datatype unit,const void *rootdata, void *leafdata, MPI_Op op);
The unit is the data type of the tree vertices. Any MPI datatype can be used. The pointers rootdata and leafdata reference the user’s data structure organized as arrays of unit s. The term op designates an MPI reduction, such asMPI SUM, which tells PetcSF to add root values to theirleaves’ previous value. As with other PetscSF operations,we currently support only MPI built-in reductions. When op is MPI REPLACE, this operation overwrites leaf datawith their root value.One can instantiate simultaneous PetscSFBcast or otherPetscSF operations on the same PetscSF with different dataand a different or same unit . A PetscSF provides only atemplate for communication. One can instantiate it on anydata. (2) Reduce leaf values to roots: To reverse PetscSFBcast, oneuses
PetscSFReduceBegin/End(PetscSF sf, MPI_Datatype unit,const void *leafdata, void *rootdata, MPI_Op op);
It reduces leaf values to their root’s old value using the givenreduction op . MPI REPLACE can be used to overwrite rootdata. (3) Fetch and operate roots: An important synchronizationand setup primitive is atomically fetching a value andupdating it. PetscSF provides this functionality through thefollowing:
PetscSFFetchAndOpBegin/End(PetscSF sf,MPI_Datatype unit,void *rootdata,const void *leafdata,void *leafupdate,MPI_Op op);
Like PetscSFReduce, it reduces leaf values to roots but doesa bit more. Say a root has n leaves, leaf ..n − , whose valuesare “added” to the root in n reductions in a non-deterministicorder. When updating using leaf i , it first fetches the current,partially reduced value of the root before adding leaf i ’svalue. The fetched value is stored at leaf i ’s position inarray leafupdate , which can be thought as a mirror of leafdata . Atomicity is guaranteed only within base-typegranularity (e.g., machine word). In our experience, themost commonly used variant is fetch-and-add on integers,which is used to simultaneously count contributions fromleaves (found in final value in rootdata ) and determineoffsets to which to send a second round of communication(found in leafupdate ). Occasionally, one wants to compose new PetscSFs fromexisting ones. (1) Multi-SF/Gather/Scatter:
The so-called multi-SF is acces-sible by the following.
PetscSFGetMultiSF(PetscSF sf, PetscSF *multi);
A multi-SF is used to gather (vs. reduce) all leaf values atroots in a deterministic way. It is derived from an initial sf where the original roots are replaced by as many newroots as their degree, with a one-to-one mapping from thenew roots to their old leaves. Effectively, roots with zero degree in sf disappear, and every root has a degree of one.With the multi-SF concept, we can define gather and scatteroperations as follows. PetscSFGatherBegin/End(PetscSF sf,MPI_Datatype unit,const void *leafdata,void *multirootdata);PetscSFScatterBegin/End(PetscSF sf,MPI_Datatype unit,const void *multirootdata,void *leafdata);
The two operations access the internal multi-SF representa-tion of sf . SFGather gathers leaf values from leafdata and stores them at multirootdata , which is an arraycontaining m unit s, where m is the number of roots ofthe multi-sf, given by the sum of the roots’ degrees of sf . SFScatter reverses the operation of
SFGather . (2) Compose: Suppose there are two PetscSFs, A and B , and A ’s leaves are partially overlapping with B ’s roots. One cancall PetscSFCompose(PetscSF A,PetscSF B,PetscSF *AB); to compose a new PetscSF AB , whose roots are A ’s roots andleaves are B ’s leaves. A root and a leaf of AB are connectedif there is a leaf in A and a root in B that bridge them, or, inother words, if the leaf and the root are mutually reachable.If A ’s leaves completely overlapped with B ’s leaves and B ’s roots all have a degree at most one, then the result of thefollowing composition is also well defined. PetscSFComposeInverse(PetscSF A,PetscSF B,PetscSF* AB); AB is a new PetscSF with A ’s roots and leaves being B ’s rootsand edges built upon a reachability definition similar to thatin PetscSFCompose. (3) Embedding: PetscSF allows removal of existing verticesand their associated edges, a common use case in scientificcomputations, for example, in field segregation in multi-physics application and subgraph extraction. The API
PetscSFCreateEmbeddedRootSF(PetscSF sf, PetscInt n,const PetscInt selected_roots[], PetscSF *esf);PetscSFCreateEmbeddedLeafSF(PetscSF sf, PetscInt n,const PetscInt selected_leaves[], PetscSF *esf) removes edges from all but the selected roots/leaves with-out remapping indices and returns a new PetscSF that canbe used to perform the subcommunication using the originalroot/leaf data buffers. SE C ASES
Although PetscSF has many uses, we describe here onlya small subset of applications of PetscSF to other PETScoperations, namely, parallel matrix and unstructured gridoperations.
Sparse matrix-vector products (SpMV)
The standard par-allel sparse matrix implementation in PETSc distributes amatrix by rows. On each MPI rank, a block of consecutiverows makes up a local matrix. The parallel vectors that canbe multiplied by the matrix are distributed by rows in aconforming manner. On each rank, PETSc splits the localmatrix into two blocks: a diagonal block A , whose columnsmatch with the vector’s rows on the current rank, and an off-diagonal block B for the remaining columns. See Fig. 3.A and B are separately encoded in the compressed sparse row (CSR) format with reduced local column indices. Then,the SpMV in PETSc is decomposed into y = Ax + Bx . Ax depends solely on local vector data, while Bx requiresaccess to remote entries of x that correspond to the nonzerocolumns of B . PETSc uses a sequential vector lvec to holdthese needed ghost points entries. The length of lvec is equalto the number of nonzero columns of B , and its entries areordered in their corresponding column order. See Fig. 3. Arank 0rank 1rank 2 B BM x lvec x x xx x x x ******** * ...
Fig. 3: A PETSc matrix on three ranks. On rank 1 thediagonal block A is in green and the off-diagonal block B is in blue. The x in B denotes nonzeros, while the * in x denotes remote vector entries needed to compute Bx .We build an PetscSF for the communication needed in Bx as described below. On each rank, there are n roots,where n is the local size of x , and m leaves, where m isthe size of vector lvec . The leaves have contiguous indicesrunning from 0 to m − , each connected to a root repre-senting a global column index. With the matrix layout infoavailable on each rank, we calculate the owner rank andlocal index of a global column on its owner and determinethe PetscSFNode argument of SFSetGraph() introduced inSection 3.1. The spMV can be implemented in the listingbelow, which overlaps the local computation y = Ax withthe communication. // x_d, lvec_d are data arrays of x, lvec respectivelyPetscSFBcastBegin(sf,MPI_DOUBLE,x_d,lvec_d,MPI_REPLACE);y = A*x;PetscSFBcastEnd(sf,MPI_DOUBLE,x_d,lvec_d,MPI_REPLACE);y += B*lvec; The transpose multiply is implemented as follows. y = AˆT*x;lvec = BˆT*x;PetscSFReduceBegin(sf,MPI_DOUBLE,lvec_d,y_d,MPI_SUM);PetscSFReduceEnd(sf,MPI_DOUBLE,lvec_d,y_d,MPI_SUM);
Note that lvec = B T x computes the local contributions tosome remote entries of y . These must be added back to theirowner rank. Extracting a submatrix from a sparse matrix
The rou-tine
MatGetSubMatrix(M,ir,ic,c,P) extracts a paral-lel submatrix P on the same number of ranks as the originalmatrix M . Sets of global indices, ir and ic , provide the rowsthe local rank should obtain and columns that should be inits diagonal block in the submatrix on this rank, respectively.The difficult part of this routine is determining, for eachrank, which columns are needed (by any ranks) from theowned block of rows.Given two SFs, sfA which maps reduced local columnindices (leaves) to global columns (roots) of the original matrix M , and sfB which maps “owned” columns (leaves) ofthe submatrix P to global columns (roots) of M , we determinethe retained local columns using the following:1) SFReduce using sfB , results in a distributed arraycontaining the columns (in the form of global in-dices) of P that columns of M will be replicated into.Unneeded columns are tagged by a negative index.2) SFBcast using sfA to get these values in the reducedlocal column space.The algorithm prepares the requested rows of M in non-split form and distributes them to their new owners, which thensplits them using a row-based PetscSF. The
DMPlex [17], [18] component in PETSc provides un-structured mesh management using a topological abstrac-tion known as a Hasse diagram [19], a directed acyclicgraph representation of a partially ordered set. The abstractmesh representation consists of points , corresponding to ver-tices, edges, faces, and cells, which are connected by arrows indicating the topological adjacency relation. This modelis dimension independent; it can represent meshes withdifferent cell types, including cells of different dimensions.Parallel topology is defined by using both a DMPlex anda PetscSF. The PetscSF is used to identify shared mesh pointsthat define the partition of the topology among processes.The determination of which process owns the root for ashared point impacts load balance, but not the parallel meshdefinition.In addition to defining the parallel topology, PetscSFsrepresent the communication patterns needed for commonmesh operations, such as partitioning and redistribution,ghost exchange, and assembly of discretized functions andoperators. To generate these multiple PetscSFs, we use the
PetscSection object, which assigns a set of degrees-of-freedom (dofs) to a mesh point. For example, if an initialPetscSF connects mesh points, then using a PetscSection tomap mesh points to dofs generates a new PetscSF relatingthe dof on different processes. Another example, with aPetscSF connecting dofs on different processes and a Petsc-Section mapping dofs to adjacent dofs, can produce thePetscSF for the Jacobian, which is described in detail in [20].
Mesh distribution , in response to a partition, is handledby creating several PetscSF objects representing the stepsneeded to perform the distribution. Based on the partition,we make a PetscSF whose roots are the original mesh pointsand whose leaves are the redistributed mesh points sothat PetscSFBcast would redistribute the points. Next, thesection describing topology and the original PetscSF buildsa PetscSF to distribute the topological relation. Then, thesections describing the data layout for coordinates and othermesh fields are used to build PetscSFs that redistribute dataover the mesh.
Ghost communication and assembly is also man-aged by a PetscSF. Starting with the PetscSF definingthe parallel mesh, we use the PetscSection to define thefield layout to construct a PetscSF identifying dofs acrossprocesses. The PETSc routines
DMGlobalToLocal() and
DMLocalToGlobal() use this PetscSF to communicate data from the global vector to the local vector with ghostpoints. Moreover, this communication pattern is reused toperform assembly as part of finite element and finite volumediscretizations. The
PetscFE and
PetscFV objects can beused to automatically create the PetscSection objects forthese data layouts, which together with the mesh pointPetscSF automatically create the assembly PetscSF. Thisstyle of programming, with the emphasis on the declarativespecification of data layout and communication patterns,automatically generating the actual communication anddata manipulation routines, has resulted in more robust,extensible, and performant code.
MPLEMENTATIONS
PetscSF has multiple implementations, including ones thatutilize MPI one-sided or two-sided communication. Wefocus on the default implementation, which uses persistentMPI sends and receives for two-sided communication. Inaddition, we emphasize the implementations for GPUs.
From the input arguments of PetscSFSetGraph, each MPIrank knows which ranks (i.e., root ranks) have roots thatthe current rank’s leaves are connected to. In PetscSFSetUp,we compute the reverse information: which ranks (i.e., leafranks) have leaves of the current rank’s roots. A typicalalgorithm uses MPI Allreduce on an integer array of theMPI communicator size, which is robust but non-scalable.A more scalable algorithm uses MPI Ibarrier, [21], which isPETSc’s default with large communicators. In the end, oneach rank, we compile the following information:1) A list of root ranks, connected by leaves on this rank;2) For each root rank, a list of leaf indices representingleaves on this rank that connects to the root rank;3) A list of leaf ranks, connected by roots on this rank;4) For each leaf rank, a list of root indices, representingroots on this rank that connect to the leaf rank.These data structures facilitate message coalescing, which iscrucial for performance on distributed memory. Note thatprocesses usually play double roles: they can be both a rootrank and a leaf rank.
With the two-sided information, we could have a simple im-plementation, using
PetscSFBcast(sf, MPI_DOUBLE,rootdata, leafdata, op) as an example (Section 3.2).Each rank, as a sender, allocates a root buffer, rootbuf ,packs the needed root data entries according to the rootindices ( rootidx ) obtained above into the buffer, in aform such as rootbuf[i] = rootdata[rootidx[i]] ,and then sends data in rootbuf to leaf ranks. Similarly,the receiver rank allocates a leaf buffer leafbuf as thereceive buffer. Once it has received data, it unpacks datafrom leafbuf and deposits into the destination leafdataentries according to the leaf indices ( leafidx ) obtainedabove, in a form such as leafdata[leafidx[i]] ⊕ =leafbuf[i] . Here ⊕ = represents op . However, PetscSF has several optimizations to loweror eliminate the packing (unpacking) overhead. First, weseparate local (i.e., self-to-self) and remote communications.If on a process the PetscSF has local edges, then the processwill show up in its leaf and root rank lists. We rearrange thelists to move self to the head if that is the case. By skippingMPI for local communication, we save intermediate sendand receive buffers and pack and unpack calls. Local com-munication takes the form leafdata[leafidx[i]] ⊕ =rootdata[rootidx[i]] . We call this a scatter operation.This optimization is important in mesh repartitioning sincemost cells tend to stay on their current owner and hencelocal communication volume is large.Second, we analyze the vertex indices and discoverpatterns that can be used to construct pack/unpack routineswith fewer indirections. The most straightforward patternis just contiguous indices. In that case, we can use theuser-supplied rootdata/leafdata as the MPI buffers withoutany packing or unpacking. An important application ofthis optimization is in spMV and its transpose introducedin Section 4.1, where the leaves, the entries in lvec , arecontiguous. Thus, lvec ’s data array can be directly usedas the MPI receive buffers in PetscSFBcast of PETSc’sspMV or as MPI send buffers in
PetscSFReduce of itstranspose product. Note that, in general, we also need toconsider the
MPI_Op involved. If it is not an assignment,then we must allocate a receive buffer before adding it to thefinal destination.Another pattern is represented by multi-strided indices,inspired by halo exchange in stencil computation on reg-ular domains. In that case, ghost points living on faces of(sub)domains are either locally strided or contiguous. Sup-pose we have a three-dimensional domain of size [X,Y,Z] with points sequentially numbered in the x, y, z order. Also,suppose that within the domain there is a subdomain of size [dx,dy,dz] with the index of the first point being start .Then, indices of points in the subdomain can be enumer-ated with expression start+X*Y*k+X*j+i , for (i,j,k) in (0 ≤ i MatMultTranspose (see Section 4.1):A single entry of the result vector y will likely receivecontributions from multiple leaf ranks. In this case, we useCUDA atomics, whereas we use regular CUDA instructionswhen no duplicated indices are present.PetscSF APIs introduced in Section 3 do not have streamor memory type arguments. Internally we call cudaPoint-erGetAttributes() to distinguish memory spaces. However,since this operation is expensive (around 0.3 µ s per callfrom our experience), we extended PetscSF APIs such asthe following: PetscSFBcastWithMemTypeBegin/End(PetscSF sf, MPI_Datatypeunit, PetscMemType rootmtype, const void *rootdata,PetscMemType leafmtype, void *leafdata, MPI_Op op); The extra PetscMemType arguments tell PetscSF where thedata is. PETSc vectors have such built-in info, so that PETScvector scatters internally use these extended APIs.We could further extend the APIs to include streamarguments. However, since stream arguments, like C/C++constantness, are so intrusive, we might have to extendmany APIs to pass around the stream. We thus take anotherapproach. PETSc has a default stream named PetscDefault-CudaStream . Almost all PETSc kernels work on this stream.When PETSc calls libraries such as cuBLAS and cuSPARSE,it sets their work stream to this default one. AlthoughPetscSF assumes that data is on the default stream, it doesprovide options for users to indicate that data is on otherstreams so that PETSc will take stricter synchronizations.Fig. 4 shows a diagram of the CUDA-aware MPI supportin PetscSF using PetscSFBcast as an example. The codesequence is similar to what we have for host communica-tion except that pack, unpack, and scatter operations areCUDA kernels, and there is a stream synchronization beforeMPI Isend, for reasons discussed above. If input data is noton PETSc’s default stream, we call cudaDeviceSynchronize()in the beginning phase to sync the whole device and cud-aStreamSynchronize() in the end phase to sync the defaultstream so that output data is ready to be accessed afterward. UserKernel<<<..,s1>>>();MPI_Irecv(leafbuf,..);Pack<<<..,s1>>>(rootdata,rootbuf,..);cudaStreamSynchronize(s1);MPI_Isend(rootbuf,..);Scatter<<<..,s1>>>(rootdata,leafdata,..);MPI_Waitall();Unpack<<<..,s1>>>(leafdata,leafbuf,..);s1 = PetscDefaultCudaStreamSFBcastBegin:SFBcastEnd:Comp. inbetween Fig. 4: CUDA-aware MPI support in PetscSFBcast. Scatter denotes the local communication. Note the cudaStreamSyn-chronize() before MPI Isend. CUDA kernel launches have a cost of around 10 µ s , whichis not negligible considering that many kernels in practicehave a smaller execution time. An important optimizationwith GPU asynchronous computation is to overlap kernellaunches with kernel executions on the GPU, so that thelaunch cost is effectively hidden. However, the mandatoryCUDA synchronization brought by MPI could jeopardizethis optimization since it blocks the otherwise nonblockingkernel launches on the host. See Fig. 5 for an example.Suppose a kernel launch takes 10 µ s and there are threekernels A, B, and C that take 20, 5, and 5 µ s to execute,respectively. If the kernels are launched in a nonblockingway (Fig. 5(L)), the total cost to run them is 40 µ s . Launchcosts of B and C are completely hidden by the execution ofA. However, if there is a synchronization after A (Fig. 5(R)),the total cost will be 55 µ s . Scientific codes usually havemany MPI calls, implying that their kernel launches will befrequently blocked by CUDA synchronizations. While theMPI community is exploring adding stream support in theMPI standard, we recently tried NVSHMEM for a remedy. ExecutionKernelLaunches A 0 10 20 30 40 t (us) B C A B CSync 0 10 20 30 40 50 t (us) (L) (R) Fig. 5: Pipelined kernel launches (L) vs. interrupted kernellaunches (R).NVSHMEM [3] is NVIDIA’s implementation of theOpenSHMEM [8] specification on CUDA devices. It sup-ports point-to-point and collective communications betweenGPUs within a node or over networks. Communicationcan be initiated either on the host or on the device. Hostside APIs take a stream argument. NVSHMEM is a PGASlibrary that provides put / get APIs for processes to accessremote data. While using get is possible, we focus on put-based communication. In NVSHMEM, a process is called aprocessing element (PE). A set of PEs is a team . The teamcontaining all PEs is called NVSHMEM_TEAM_WORLD . PEscan query their rank in a team and the team’s size. One PE can use only one GPU. These concepts are analogousto ranks, communicators, and MPI_COMM_WORLD in MPI.NVSHMEM can be used with MPI. It is natural to map oneMPI rank to one PE. We use PEs and MPI ranks interchange-ably. With this approach, we are poised to bypass MPI todo communication on GPUs while keeping the rest of thePETSc code unchanged.For a PE to access remote data, NVSHMEM uses the symmetric data object concept. At NVSHMEM initialization,all PEs allocate a block of CUDA memory, which is calleda symmetric heap. Afterwards, every remotely accessibleobject has to be collectively allocated/freed by all PEs in NVSHMEM_TEAM_WORLD , with the same size, so that such anobject always appears symmetrically on all PEs, at the sameoffset in their symmetric heap. PEs access remote data byreferencing a symmetric address and the rank of the remotePE. A symmetric address is the address of a symmetricobject on the local PE, plus an offset if needed. The codebelow allocates two symmetric double arrays src[1] and dst[2] , and every PE puts a double from its src[0] to thenext PE’s dst[1] . double *src = nvshmem_malloc(sizeof(double));double *dst = nvshmem_malloc(sizeof(double)*2);int pe = nvshmem_team_my_pe(team); // get my rankint size = nvshmem_team_n_pes(team), next = (pe+1)%nvshmemx_double_put_on_stream(&dst[1],src,1,next,stream); For PEs to know the arrival of data put by other PEsand then read it, they can call a collective nvshmem bar-rier(team) to separate the put and the read, or senders cansend signals to receivers for checking. Signals are merelysymmetric objects of type uint64 t. We prefer the latterapproach since collectives are unfit for sparse neighborhoodcommunications that are important to PETSc. Because ofthe collective memory allocation constraint, we supportNVSHMEM only on PetscSFs built on PETSC_COMM_-WORLD , which is the MPI communicator we used to initializePETSc and NVSHMEM.Fig. 6 gives a skeleton of the NVSHMEM support inPetscSF, again using PetscSFBcast as an example. We cre-ate a new stream RemoteCommStream (s2) to take chargeof remote communication, such that communication anduser’s computation, denoted by UserKernel , could beoverlapped. First, on PetscDefaultCudaStream (s1), werecord a CUDA event SbufReady right after the packkernel to indicate data in the send buffer is ready for send.Before sending, stream s2 waits for the event so that thesend-after-pack dependence is enforced. Then PEs put dataand end-of-put signals to destination PEs. To ensure thatsignals are delivered after data, we do an NVSHMEM mem-ory fence at the local PE before putting signals. In the endphase, PEs wait until end-of-put signals targeting them havearrived (e.g., through nvshmem_uint64_wait_until_-all() ). Then they record an event CommEnd indicating endof communication on s2. PEs just need to wait for that eventon s1 before unpacking data from the receive buffer. Notethat all function calls in Fig. 6 are asynchronous.NVSHMEM provides users a mechanism to distinguishlocally accessible and remotely accessible PEs. One canroughly think of the former as intranode PEs and thelatter as internode PEs. We take advantage of this and usedifferent NVSHMEM APIs when putting data. We call the UserKernel<<<..,s1>>>();cudaStreamWaitEvent(s1,CommEnd);Unpack<<<..,s1>>>(leafdata,leafbuf,..); WaitSignal(..,s2);cudaEventRecord(CommEnd,s2);s1 = PetscDefaultCudaStream s2 = RemoteCommStreamSFBcastBegin: SFBcastEnd:Scatter<<<..,s1>>>(rootdata,leafdata,..);Pack<<<..,s1>>>(rootdata,rootbuf,..);cudaEventRecord(SbufReady,s1); cudaStreamWaitEvent(s2,SbufReady);PutData(..,remote_leafbuf,s2);FenceAndPutSignal(..,remote_signal,s2); Fig. 6: Stream-aware NVSHMEM support in PetscSFBcast.Blue boxes are in the beginning phase, and green boxes arein the end phase. The red box in between is the user code.Dashed lines represent data dependence between streams.Functions are ordered vertically and called asynchronously.host API nvshmemx_putmem_nbi_on_stream() for eachlocal PE, and the device API nvshmem_putmem_nbi() onCUDA threads, with each targeting a remote PE. For localPEs, the host API uses the efficient CUDA device copyengines to do GPUDirect peer-to-peer memory copy, whilefor remote PEs, it uses slower GPUDirect RDMA.We now detail how we set up to send and receive buffersfor NVSHMEM. In CUDA-aware MPI support of PetscSF,we generally need to allocate on each rank two CUDAbuffers, rootbuf and leafbuf , to function as MPI send orreceive buffers. Now we must allocate them symmetricallyto make them accessible to NVSHMEM. To that end, wecall an MPI Allreduce to get their maximal size over thecommunicator and use the result in nvshmem_malloc() .As usual, leafbuf is logically split into chunks of varioussizes (see Fig. 7). Each chunk in an PetscSFBcast operation isused as a receive buffer for an associated root rank. Besides leafbuf , we allocate a symmetric object leafRecvSig[] ,which is an array of the end-of-put signals with each entryassociated with a root rank. In one-sided programming,a root rank has to know the associated chunk’s offset in leafbuf to put the data and also the associated signal’soffset in leafRecvSig[] to set the signal. The precedingexplanations apply to rootbuf and leaf ranks similarly.During PetscSFSetUp, we use MPI two-sided to assemblethe information needed for NVSHMEM one-sided. At theend of the setup, on each rank, we have the following newdata structures in addition to those introduced in Section5.1:1) A list of offsets, each associated with a leaf rank,showing where the local rank should send (put) its rootbuf data to these leaf ranks’ leafbuf 2) A list of offsets, each associated with a leaf rank,showing where the local rank should set signals onthese leaf ranks3) A list of offsets, each associated with a root rank,showing where the local rank should send (put) its leafbufremote rootranks ... leafRecvSig ... ... local rank: Fig. 7: Right: two symmetric objects on a local rank. Left:remote root ranks putting data to the objects. Root ranksneed to know offsets at the remote side. The cloud indicatesthese are remote accesses. leafbuf data to these root ranks’ rootbuf 4) A list of offsets, each associated with a root rank,showing where the local rank should set signals onthese root ranksWith these, we are almost ready to implement PetscSFwith NVSHMEM. But there is a new complexity comingwith one-sided communication. Suppose we do communi-cation in a loop. When receivers are still using their receivebuffer, senders could move into the next iteration and putnew data into the receivers’ buffer and corrupt it. To avoidthis situation, we designed a protocol shown in Fig. 8,between a pair of PEs (sender and receiver). Besides theend-of-put signal (RecvSig) on the receiver side, we allocatean ok-to-put signal (SendSig) on the sender side. The sendermust wait until the variable is 0 to begin putting in the data.Once the receiver has unpacked data from its receive buffer,it sends 0 to the sender’s SendSig to give it permission toput the next collection of data. Pack data to send buffer Wait until Sendsig = 0, then set it to 1Put datanvshmem_fencePut 1 to RecvSig to end put SIGNALDATASIGNAL Sender Receiver(Initial SendSig = 0) (Initial RecvSig = 0) Wait until RecvSig = 1, then set it to 0Unpack data from receive bufferPut 0 to SendSig to allow to put again Fig. 8: Protocol of the PetscSF NVSHMEM put-based com-munication. The dotted line indicates that the signal belowis observable only after the remote put is completed at thedestination. We have rootSendSig, rootRecvSig, leafSendSig,and leafRecvSig symmetric objects on each PE, with aninitial value 0. XPERIMENTAL R ESULTS We evaluated PetscSF on the Oak Ridge Leadership Com-puting Facility (OLCF) Summit supercomputer as a surro-gate for the upcoming exascale computers. Each node of Summit has two sockets, each with an IBM Power9 CPUaccompanied by three NVIDIA Volta V100 GPUs. Each CPUand its three GPUs are connected by the NVLink intercon-nect at a bandwidth of 50 GB/s. Communication betweenthe two CPUs is provided by IBM’s X-Bus at a bandwidthof 64 GB/s. Each CPU also connects through a PCIe Gen4x8 bus with a Mellanox InfiniBand network interface card(NIC) with a bandwidth of 16 GB/s. The NIC has an injec-tion bandwidth of 25 GB/s. On the software side, we usedgcc 6.4 and the CUDA-aware IBM Spectrum MPI 10.3. Wealso used NVIDIA CUDA Toolkit 10.2.89, NVSHMEM 2.0.3,NCCL 2.7.8 (NVIDIA Collective Communication Library,which implements multi-GPU and multi-node collectivesfor NVIDIA GPUs), and GDRCopy 2.0 (a low-latency GPUmemory copy library). NVSHMEM needs the latter two. To determine the overhead of PetscSF, we wrote a ping pongtest sf pingpong in PetscSF, to compare PetscSF performanceagainst raw MPI performance. The test uses two MPI ranksand a PetscSF with n roots on rank 0 and n leaves onrank 1. The leaves are connected to the roots consecutively.With this PetscSF and op = MPI_REPLACE , PetscSFBcastsends a message from rank 0 to rank 1, and a followingSFReduce bounces a message back. Varying n, we can thenmeasure the latency of various message sizes, mimicing the osu latency test from the OSU microbenchmarks [23]. Bycomparing the performance attained by the two tests, wecan determine the overhead of PetscSF.We measured PetscSF MPI latency with either host-to-host messages (H-H) or device-to-device (D-D) messages.By device messages, we mean regular CUDA memory data,not NVSHMEM symmetric memory data. Table 1 shows theintra-socket results, where the two MPI ranks were boundto the same CPU and used two GPUs associated with thatCPU. The roots and leaves in this PetscSF are contiguous, soPetscSF’s optimization skips the pack/unpack operations.Thus this table compares a raw MPI code with a PetscSFcode that embodies much richer semantics. Comparing theH-H latency, we see that PetscSF has an overhead from 0.6to 1.0 µ s , which is spent on checking input arguments andbookkeeping. The D-D latency is interesting. It shows thatPetscSF has an overhead of about 5 µ s over the OSU test.We verified this was because PetscSF calls cudaStreamSyn-chronize() before sending data, whereas the OSU test doesnot. We must have the synchronization in actual applicationcodes, as discussed before. We also performed inter-socketand inter-node tests. Results, see [24], indicated a similargap between the PetscSF test and the OSU test.TABLE 1: Intra-socket host-to-host (H-H) latency anddevice-to-device (D-D) latency (µ s ) measured by osu la-tency (OSU) and sf pingpong (SF), with IBM Spectrum MPI. Msg (Bytes) 1K 4K 16K 64K 256K 1M 4MOSU H-H 0.8 1.3 3.5 4.7 12.2 36.3 152.4SF H-H 1.5 1.9 4.2 5.5 12.9 37.3 151.8OSU D-D 17.7 17.7 17.8 18.4 22.5 39.2 110.3SF D-D 22.8 23.0 22.9 23.5 27.7 46.3 111.8 We used the same test and compared PetscSF MPI andPetscSF NVSHMEM, with results shown in Fig. 9. For small 256 1K 4K 16K 64K 256K 1M 4M La t en cy ( M i c r o s e c ond s ) Message size (Bytes) Intra-socket SF MPIIntra-socket SF NVSHMEMInter-socket SF MPIInter-socket SF NVSHMEMInter-node SF MPIInter-node SF NVSHMEM Fig. 9: Device to device (D-D) latency measured by sf ping-pong, using IBM Spectrum MPI or NVSHMEM.messages, NVSHMEM’s latency is about 12 µ s higher thanMPI’s in intra-socket and inter-socket cases and 40 µ s morein the inter-node cases. For large inter-socket messages, thegap between the two is even larger (up to 60 µ s at 4 MB).Possible reasons for the performance difference includethe following: (1) In PetscSF NVSHMEM, there are extramemory copies of data between the root/leafdata in CUDAmemory and the root and leaf buffers in NVSHMEM mem-ory. (2) The current NVSHMEM API has limitations. Forexample, we would like to use fewer kernels to implementthe protocol in Fig. 8. Within a node, the NVSHMEM hostAPI delivers much better performance than its device API,forcing us to do signal-wait through device API, but data-put through the host API. For another example, NVSHMEMprovides a device API to do data-put and signal-put in onecall, but there is no host counterpart. One has to use twokernel launches for this task using the host API for data-put. All these extra kernel launches increase the latency; (3)NVSHMEM is still a new NVIDIA product. There is muchheadroom for it to grow. To explore distributed asynchronous execution on GPUsenabled by NVSHMEM, we adapted CG, the Krylov conju-gate gradient solver in PETSc, to a prototype asynchronousversion CGAsync. Key differences between the two areas follows. (1) A handful of PETSc routines they callare different. There are two categories. The first includesroutines with scalar output parameters, for example, vec-tor dot product. CG calls VecDot(Vec x,Vec y,double*a) with a pointing to a host buffer, while CGAsynccalls VecDotAsync(Vec x, Vec y, double *a) with a referencing a device buffer. In VecDot, each process callscuBLAS routines to compute a partial dot product andthen copies it back to the host, where it calls MPI Allre-duce to get the final dot product and stores it at thehost buffer. Thus VecDot synchronizes the CPU and theGPU device. While in VecDotAsync, once the partial dotproduct from cuBLAS is computed, each process calls anNVSHMEM reduction operation on PETSc’s default streamto compute the final result and stores it at the devicebuffer. The second category of differences includes rou- tines with scalar input parameters, such as VecAXPY(Vecy,double a, Vec x) , which computes y += a*x. CGcalls VecAXPY while CGAsync calls VecAXPYAsync(Vecy,double *a, Vec x) with a referencing device mem-ory, so that VecAXPYAsync can be queued to a stream while a is computed on the device. (2) CG does scalar arithmetic(e.g., divide two scalars) on the CPU, while CGAsync doesthem with tiny scalar kernels on the GPU. (3) CG checksconvergence (by comparison) in every iteration on the CPUto determine whether it should exit the loop while CGAsyncdoes not. Users need to specify maximal iterations. Thiscould be improved by checking for convergence every few(e.g., 20) iterations. We leave this as future work.We tested CG and CGAsync without preconditioning ona single Summit compute node with two sparse matricesfrom the SuiteSparse Matrix Collection [25]. The CG wasrun with PetscSF CUDA-aware MPI, and CGAsync was runwith PetscSF NVSHMEM. The first matrix is Bump 2911with about 3M rows and 128M nonzero entries. We ranboth algorithms 10 iterations with 6 MPI ranks and oneGPU per rank. Fig. 10 shows their timeline through theprofiler NVIDIA NSight Systems. The kernel launches (label CUDA API ) in CG were spread over the 10 iterations. Thereason is that in each iteration, there are multiple MPI calls(mainly in MatMult as discussed in Section 4.1, and vectordot and norm operations), which constantly block the kernellaunch pipeline. In CGAsync, however, while the devicewas executing the 8th iteration (with profiling), the hosthad launched all kernels for 10 iterations. The long red barcudaMemcpyAsync indicates that after the kernel launches,the host was idle, waiting for the final result from the device.Fig. 10: Timeline of CG (top) and CGAsync (bottom) on rank2. Each ran ten iterations. The blue csr... bars are csrMV(i.e., SpMV) kernels in cuSPARSE, and the red c... bars arecudaMemcpyAsync() copying data from device to host.Test results show that the time per iteration for CG andCGAsync was about 690 and 676 µ s , respectively. CGAsyncgave merely a 2% improvement. This small improvementis because the matrix is huge, and computation is usingthe vast majority of the time. From profiling, we knewSpMV alone (excluding communication) took 420 µ s . Ifone removes the computational time, the improvement incommunication time is substantial. Unfortunately, becauseof bugs in the NVSHMEM library with multiple nodes, wecould not scale the test to more compute nodes. Instead, weused a smaller matrix, Kuu, of about 7K rows and 340Knonzero entries to see how CGAsync would perform ina strong-scaling sense. We repeated the above tests. Timeper iteration for CG and CGAsync was about 300 and 250 µ s . CGAsync gave a 16.7% improvement. Note that thisimprovement was attained despite the higher ping ponglatency of PetscSF NVSHMEM.We believe CGAsync has considerable potential for im-provement. As the NVSHMEM library matures, it shouldreach or surpass MPI’s ping pong performance. We alsonote that there are many kernels in CGAsync; for the scalarkernels mentioned above, kernel launch times could not beentirely hidden with small matrices. We are investigatingtechniques like CUDA Graphs to automatically fuse kernelsin the loop to further reduce the launch cost; with MPI,such fusion within an iteration is not possible due to thesynchronizations that MPI mandates. This section reports on the robustness and efficiency ofthe PetscSF infrastructure as used in the mesh distributionprocess. In the first stage, the cell-face connectivity graph isconstructed and partitioned, followed by the mesh data’sactual migration (topology, labels associated with meshpoints, and cell coordinates) and then the distributed mesh’sfinal setup. We do not analyze the stage around graphpartitioning and instead focus on the timings associatedwith the distribution of the needed mesh data followed bythe final local setup.We consider the migration induced by a graph partition-ing algorithm on three initial distributions of a fully periodic × × hexahedral mesh: • Sequential: the mesh is entirely stored on one pro-cess. • Chunks: the mesh is stored in non-overlappingchunks obtained by a simple distribution of the lexi-cographically ordered cells. • Rand: the mesh is stored randomly among processes.The sequential case is common in scientific applicationswhen the mesh is stored in a format that is not suitablefor parallel reading. It features a one-to-all communicationpattern. The Chunks and Rand cases represent differentmesh distribution scenarios after parallel read, and a many-to-many communication pattern characterizes them. Ideally,the Chunks case would have a more favorable communi-cation pattern than the Rand case, where potentially allprocesses need to send/receive data from all processes.Fig. 11 collects the mesh migration timing as a function ofthe number of processes used in the distribution. Timingsremain essentially constant as the number of processes isincreased from 420 to 16,800 due to increased communica-tion times and a decrease in the subsequent local setup time,confirming the scalability of the overall implementation. We recently developed a generic driver for parallel sparsematrix-matrix multiplications that takes advantage of theblock-row distribution used in PETSc for their storage; seeFig. 3. Here we report preliminary results using cuSparseto store and manipulate the local matrices. In particular,given two parallel sparse matrices A and P , we considerthe matrix product AP and the projection operation P T AP ,by splitting them into three different phases: Fig. 11: Mesh migration timings as a function of the numberof MPI processes for the Sequential (red), Chunks (blue),and Rand (magenta) cases.1) Collect rows of P corresponding to the nonzerocolumns of the off-diagonal portion of A .2) Perform local matrix-matrix operations.3) Assemble the final product, possibly involving off-process values insertion.Step 1 is a submatrix extraction operation, while step 2corresponds to a sequence of purely local matrix-matrixproducts that can execute on the device. We set up an in-termediate PetscSF with leaf vertices represented by the rowindices of the result matrix and root vertices given by its rowdistribution during the symbolic phase. The communicationof the off-process values needed in step 3 is then performedusing the SFGather operation on a temporary GPU buffer,followed by local GPU-resident assembly.The performances of the numerical phase of these twomatrix-matrix operations on GPUs are plotted in Fig. 12for different numbers of nodes of Summit, and they arecompared against our highly optimized CPU matrix-matrixoperations directly calling MPI send and receive routines;the A operators are obtained from a second-order finite ele-ment approximation of the Laplacian, while the P matricesare generated by the native algebraic multigrid solver inPETSc. The finite element problem defined on a tetrahedralunstructured mesh is partitioned and weakly scaled amongSummit nodes, from 1 to 64; the number of rows of A rangesfrom 1.3 million to 89.3 million. A strong-scaling analysisis carried out within a node, and the timings needed bythe numerical algorithm using 6 GPUs per node (label 6G)are compared against an increasing number of cores pernode (from 6 to 42, labeled 6C and 42C, respectively). TheGalerkin triple matrix product shows the most promisingspeedup, while the performances of the matrix product AP are more dependent on the number of nodes. Weplan to perform further analysis and comparisons whenour NVSHMEM backend for PetscSF supports multinodeconfigurations and we have full support for asynchronousdevice operations using streams within the PETSc library. ONCLUSION AND F UTURE W ORK We introduced PetscSF, the communication component inPETSc, including its programming model, its API, and its Fig. 12: Timings for parallel sparse matrix-matrix numericalproducts. Left is AP ; right is P T AP . 6 GPUs per node (6G)are compared against an increasing number of cores pernode (6C-42C) with different numbers of Summit nodes.implementations. We emphasized the implementation onGPUs since one of our primary goals is to provide highlyefficient PetscSF implementations for the upcoming exas-cale computers. Our experiments demonstrated PetscSF’sperformance, overhead, scalabilty, and novel asynchronousfeatures. We plan to continue to optimize PetscSF for exas-cale systems and to investigate asynchronous computationon GPUs enabled by PetscSF at large scale. A CKNOWLEDGMENTS We thank Akhil Langer and Jim Dinan from the NVIDIANVSHMEM team for their assistance. This work was sup-ported by the Exascale Computing Project (17-SC-20-SC), acollaborative effort of the U.S. Department of Energy Officeof Science and the National Nuclear Security Administra-tion, and by the U.S. Department of Energy under ContractDE-AC02-06CH11357 and Office of Science Awards DE-SC0016140 and DE-AC02-0000011838. This research usedresources of the Oak Ridge Leadership Computing Facili-ties, a DOE Office of Science User Facility supported underContract DE-AC05-00OR22725. R EFERENCES [1] R. T. Mills, M. F. Adams, S. Balay, J. Brown, A. Dener,M. Knepley, S. E. Kruger, H. Morgan, T. Munson, K. Rupp,B. F. Smith, S. Zampini, H. Zhang, and J. Zhang, “Towardperformance-portable PETSc for GPU-based exascale systems,” arXiv:2011.00715v1 IEEE Parallel &Distributed Technology: Systems & Applications , vol. 1, no. 1, pp. 25–42, 1993.[5] T. El-Ghazawi and L. Smith, “UPC: Unified Parallel C,” in Proceed-ings of the 2006 ACM/IEEE conference on Supercomputing , 2006, pp.27–es.[6] R. W. Numrich and J. Reid, “Co-Array Fortran for parallel pro-gramming,” in ACM Sigplan Fortran Forum , vol. 17, no. 2, 1998, pp.1–31.[7] B. L. Chamberlain, S. Deitz, M. B. Hribar, and W. Wong, “Chapel,” Programming Models for Parallel Computing , pp. 129–159, 2015. ScientificProgramming , vol. 20, no. 2, pp. 129–150, 2012.[10] “GSLIB,” https://github.com/Nek5000/gslib, 2021.[11] “nek5000,” https://nek5000.mcs.anl.gov, 2021.[12] D. Morozov and T. Peterka, “Block-parallel data analysis withDIY2,” in . IEEE, 2016, pp. 29–36.[13] H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos: Enablingmanycore performance portability through polymorphic memoryaccess patterns,” Journal of Parallel and Distributed Computing ,vol. 74, no. 12, pp. 3202–3216, 2014, Domain-Specific Languagesand High-Level Frameworks for High-Performance Computing.[14] D. A. Beckingsale, J. Burmark, R. Hornung, H. Jones, W. Killian,A. J. Kunen, O. Pearce, P. Robinson, B. S. Ryujin, and T. R.Scogland, “RAJA: Portable performance for large-scale scientificapplications,” in In preparation, Army HighPerformance Computing Research . Citeseer, 1994.[17] M. G. Knepley and D. A. Karpeev, “Mesh algorithms for PDE withSieve I: Mesh distribution,” Scientific Programming , vol. 17, no. 3,pp. 215–230, 2009.[18] M. Lange, L. Mitchell, M. G. Knepley, and G. J. Gorman, “Efficientmesh management in Firedrake using PETSc-DMPlex,” SIAMJournal on Scientific Computing , vol. 38, no. 5, pp. S143–S155, 2016.[19] Wikipedia, “Hasse diagram,” 2015, http://en.wikipedia.org/wiki/Hasse diagram.[20] M. G. Knepley, M. Lange, and G. J. Gorman, “Unstructured over-lapping mesh distribution in parallel,” https://arxiv.org/abs/1506.06194, 2017.[21] T. Hoefler, C. Siebert, and A. Lumsdaine, “Scalable communica-tion protocols for dynamic sparse data exchange,” ACM SigplanNotices , vol. 45, no. 5, pp. 159–168, 2010.[22] N. Dryden, N. Maruyama, T. Moon, T. Benson, A. Yoo, M. Snir,and B. Van Essen, “Aluminum: An asynchronous, GPU-awarecommunication library optimized for large-scale training of deepneural networks on HPC systems,” in , Nov 2018, pp. 1–13.[23] D. Panda et al. , “OSU micro-benchmarks v5.7,” http://mvapich.cse.ohio-state.edu/benchmarks/ , 2020.[24] J. Zhang, R. T. Mills, and B. Smith, “Evaluation of PETSc ona heterogeneous architecture,the OLCF Summit system part II:Basic communication performance,” Mathematics and ComputerScience Division, Argonne National Laboratory, Tech. Rep. ANL20-76, 2020.[25] T. A. Davis and Y. Hu, “The University of Florida Sparse MatrixCollection,” ACM Trans. Math. Softw. , vol. 38, no. 1, Dec. 2011. Junchao Zhang is a software engineer atArgonne National Laboratory. He received hisPh.D.in computer science from the ChineseAcademy of Sciences, Beijing, China. He isa PETSc developer and works mainly on thecommunication and GPU support part in PETSc. Jed Brown is an assistant professor of computerscience at the University of Colorado Boulder.He received his Dr.Sc. from ETH Z¨urich andBS+MS from the University of Alaska Fairbanks.He is a maintainer of PETSc and leads a re-search group on fast algorithms and communitysoftware for physical prediction, inference, anddesign. Satish Balay is a software engineer at ArgonneNational Laboratory. He received his M.S. incomputer science from Old Dominion University.He is a developer of PETSc. Jacob Faibussowitsch is a Ph.D. student inmechanical engineering and computational sci-ence and engineering at the University of Illinoisat Urbana-Champaign, where he also receivedhis B.S. His work focuses on high-performancescalable fracture mechanics at the Center forExascale-Enabled Scramjet Design. He is a de-veloper of PETSc. Matthew Knepley is an associate professor inthe University at Buffalo. He received his Ph.D.in computer science from Purdue University andhis B.S. from Case Western Reserve University.His work focuses on computational science, par-ticularly in geodynamics, subsurface flow, andmolecular mechanics. He is a maintainer ofPETSc. Oana Marin is an assistant applied mathemat-ics specialist at Argonne National Laboratory.She received her Ph.D. in theoretical numeri-cal analysis at the Royal Institute of Technol-ogy, Sweden. She is an applications-orientedapplied mathematician who works on numericaldiscretizations in computational fluid dynamics,mathematical modeling, and data processing. Richard Tran Mills is a computational scientistat Argonne National Laboratory. His researchspans high-performance scientific computing,machine learning, and the geosciences. He isa developer of PETSc and the hydrology codePFLOTRAN. He earned his Ph.D. in computerscience at the College of William and Mary, sup-ported by a U.S. Department of Energy Compu-tational Science Graduate Fellowship. Todd Munson is a senior computational scien-tist at Argonne National Laboratory and the Soft-ware Ecosystem and Delivery Control AccountManager for the U.S. DOE Exascale Comput-ing Project. His interests range from numericalmethods for nonlinear optimization and varia-tional inequalities to workflow optimization foronline data analysis and reduction. He is a de-veloper of the Toolkit for Advanced Optimization. Barry F. Smith is an Argonne National Labora-tory Associate. He is one of the original devel-opers of the PETSc numerical solvers library. Heearned his Ph.D. in mathematics at the CourantInstitute. Stefano Zampini is a research scientist in theExtreme Computing Research Center of KingAbdullah University for Science and Technology(KAUST), Saudi Arabia. He received his Ph.D.in applied mathematics from the University ofMilano Statale. He is a developer of PETSc.3