[PDF] Delayed approximate matrix assembly in multigrid with dynamic precisions

Abstract

The accurate assembly of the system matrix is an important step in any code that solves partial differential equations on a mesh. We either explicitly set up a matrix, or we work in a matrix-free environment where we have to be able to quickly return matrix entries upon demand. Either way, the construction can become costly due to non-trivial material parameters entering the equations, multigrid codes requiring cascades of matrices that depend upon each other, or dynamic adaptive mesh refinement that necessitates the recomputation of matrix entries or the whole equation system throughout the solve. We propose that these constructions can be performed concurrently with the multigrid cycles. Initial geometric matrices and low accuracy integrations kickstart the multigrid, while improved assembly data is fed to the solver as and when it becomes available. The time to solution is improved as we eliminate an expensive preparation phase traditionally delaying the actual computation. We eliminate algorithmic latency. Furthermore, we desynchronise the assembly from the solution process. This anarchic increase of the concurrency level improves the scalability. Assembly routines are notoriously memory- and bandwidth-demanding. As we work with iteratively improving operator accuracies, we finally propose the use of a hierarchical, lossy compression scheme such that the memory footprint is brought down aggressively where the system matrix entries carry little information or are not yet available with high accuracy.

Full PDF

DDelayed approximate matrix assembly in multigrid withdynamic precisions ∗ Charles D. Murray Tobias WeinzierlMay 8, 2020

The accurate assembly of the system matrix is an important step in any code that solves partialdiﬀerential equations on a mesh. We either explicitly set up a matrix, or we work in a matrix-freeenvironment where we have to be able to quickly return matrix entries upon demand. Either way,the construction can become costly due to non-trivial material parameters entering the equations,multigrid codes requiring cascades of matrices that depend upon each other, or dynamic adaptivemesh reﬁnement that necessitates the recomputation of matrix entries or the whole equation systemthroughout the solve. We propose that these constructions can be performed concurrently with themultigrid cycles. Initial geometric matrices and low accuracy integrations kickstart the multigrid,while improved assembly data is fed to the solver as and when it becomes available. The timeto solution is improved as we eliminate an expensive preparation phase traditionally delaying theactual computation. We eliminate algorithmic latency. Furthermore, we desynchronise the assemblyfrom the solution process. This anarchic increase of the concurrency level improves the scalability.Assembly routines are notoriously memory- and bandwidth-demanding. As we work with iterativelyimproving operator accuracies, we ﬁnally propose the use of a hierarchical, lossy compression schemesuch that the memory footprint is brought down aggressively where the system matrix entries carrylittle information or are not yet available with high accuracy.

Multigrid algorithms are among the fastest solvers known for elliptic partial diﬀerential equations(PDEs) of the type − ∇ ( (cid:15) ∇ ) u = f (1)on a d -dimensional, well-shaped domain Ω. An approximation of the function u : Ω (cid:55)→ R is whatwe are searching for with (cid:15) : Ω (cid:55)→ R + as a material parameter, and f : Ω (cid:55)→ R constituting theright-hand side. The system is closed by appropriate boundary conditions. We restrict ourselvesto Dirichlet conditions here. A Ritz-Galerkin ﬁnite element discretisation over a mesh Ω h thatgeometrically discretises Ω yields an equation system A h u h = f h . This is the linear equation system ∗ The work was funded by an EPSRC DTA PhD scholarship (award no. 1764342). It made use of the facilities ofthe Hamilton HPC Service of Durham University. The underlying project has received funding from the EuropeanUnion’s Horizon 2020 research and innovation programme under grant agreement No 671698 (ExaHyPE). Thispaper is an extended version of C.D. Murray and T. Weinzierl:

Lazy stencil integration in multigrid algorithms asintroduced and published at the PPAM’19 conference. a r X i v : . [ c s . M S ] M a y GImplementationCorrectness(consistency) .... hide operator assemblybehind ﬁ rst multigrid cycles(incrementally improveaccuracy)Determine required operatoraccuracy per mesh entity Determine operatorsiteratively; inducesrecomputes, but ... Number of sweeps Cost (time)per cycleAssemblycost (time)Memoryfootprint1 2 345

Figure 1: Multigrid implementation challenges: (1) An implementation has to be correct, i.e. yieldthe result of the underlying mathematics and thus be consistent with it, (2) there have to be as fewdata structure as possible, while the cost per cycle (3), the assembly/setup cost (4) and the memoryfootprint (5) have to be small, too. (3),(4),(5) are the core areas where we make a contribution,while implications on (1) and (2) are studied.actually tackled by multigrid (MG). Equations of the type (1) arise in many application domainsstudying (quasi-)stationary phenomena. They also are an important building block within manytime-dependent problems, where they model incompressibility conditions or friction for example.Finally, they also arise in Lagrangian setups, where they model gravity between moving objects forexample.Common to all sketched application areas is that solving A h u h = f h , i.e. ﬁnding u h ≈ A − h f h ,is expensive. The ellipticity of (1) implies that any change of a solution value anywhere withinΩ impacts the solution over the entirety of Ω. Finite elements and related techniques manage tobreak up this strong global dependency by discretising the PDE with test and shape functions thathave local support. Any single update of an entry of u h aﬀects only neighbouring elements withinthe discretisation, i.e. few other entries within u h . It propagates through the whole domain fromthere. We work with equation systems that are sparse, and thus manageable from both a memoryand compute eﬀort point of view. However, information propagation this way is intrinsically slow.Multigrid compensates for this eﬀect as it removes errors across a hierarchy of meshes. Coarserand coarser grids take ownership of updates that yield non-local modiﬁcations, i.e. they handlelow-frequency errors from the ﬁne grid. This multilevel approach to tackle an elliptic problem on acascade of resolutions is the seminal idea behind multigrid and the reason why multigrid is fast.When we implement multigrid, the implementation has to be in line with the mathematicaltheory, while the cost of the actual implementation is determined by at least four factors: thenumber of solver iterations; the time taken for a single solver iteration including all data traversals—of which there can be multiple of them in the MG context; the compute cost (time) to set up allrequired operators (matrices) on all of the diﬀerent scales and all required operators couplingdiﬀerent scales; and the memory footprint of the method (Fig. 1). Traditionally, computer scienceresearch focuses on the ﬁrst two aspects—both from a numerical point of view and with performanceengineering glasses on. Yet, there are more and more cases where the latter two constraints becomeprescient. There are at least three reasons for this. Firstly, the proportion of the runtime spenton the solution phase often diminishes relative to (re-)assembly time: If (1) serves as a building2lock within a time-stepping code, a solution u will often not change radically between time steps.Along the same lines, dynamic adaptivity changes the mesh and consequently demands for a newdiscretisation of (1). Yet, a grid update usually does not change the solution completely. As a result,a low number of iterations or cycles yields a valid solution in both cases as long as we start from theprevious step’s solution as an initial guess. Secondly, we suﬀer from a widening memory gap on ourmachines [14]. The increase in memory access speed cannot keep pace with the growth of computepower on newer systems. As dynamic adaptivity—and non-linear systems which are out of scopehere—become mainstream, we assemble multiple times. Conversely one assembly remains suﬃcientfor static meshes and linear PDEs. Each re-assembly stresses the memory interconnects since it istypically not compute-intense. Yet, this is the step where volumetric domain information ( (cid:15) -ﬁelds)has to be streamed into the core. The impact is ampliﬁed by multigrid’s inherent multiscale nature:It is not only a single matrix A h discretised (1), rather we have to maintain and construct a seriesof matrices A h , A kh , A k h , . . . for coarsening by a factor of k ≥

2, as well as the corresponding inter-grid transfer operators, i.e. prolongations and restrictions. Finally, due to this multitude of involvedoperators, multigrid is memory-demanding. In an era with stagnating or even decreasing memoryper code, the overhead to store restriction, prolongation and coarse grid matrices quickly becomes alimiting constraint. Purely geometric approaches with rediscretisation avoid this overhead. Theyconstruct local matrix entries on-demand; usually when they are required throughout a matrix-vector product. In such a case, there is no need to store the matrix entries et all. It is howeverknown that massive sudden (cid:15) changes, additional convective terms or nonlinearities here requirevery detailed operator recomputations. We trade the memory demands for heavy compute eﬀort.Furthermore, geometric multigrid tends to become unstable. in case of non-trivial (cid:15) distributions[28, 31, 32] and is not stable in a multigrid sense either [37].The additional operators introduced by multigrid on top of the actual discretisation are correc-tion operators, while the overall scheme is an iterative approach. Furthermore, the fundamentalsolutions to (1) quickly decay. Ergo, neither do the coarse grid equations have to tackle the exactequation right from the start, nor do we need exact ﬁne grid operators prior to the ﬁrst iteration.Furthermore, if an operator is slightly wrong, i.e. if it yields slightly wrong iterates, this error willmainly aﬀect the area around the erroneous update in subsequent applications, as the fundamentalsolutions to (1) quickly decay. Therefore, it is suﬃcient to kick oﬀ with approximate operators, aslong as (i) we later increase their accuracy such that we eventually solve the right equation sys-tem; (ii) the smoother continually pushes the solution into the right direction; (iii) the correctionequations do not impede these improvements; and (iv) we quickly get the majority of operatorswithin the computational domain right. Our idea therefore is it to kick oﬀ multigrid with verycrude ﬁne grid approximations plus geometric coarse grid and inter-grid transfer operators whichcan be quickly precomputed. While we run the multigrid cycle, we successively improve theapproximation quality of the ﬁne grid operators. These improvements are deployed as tasks whichrun in the background of the actual solver: While our solver determines initial solution approxi-mations, we derive the correct operators describing the true solution. The overall integration isprecision-guided, i.e. we do continuously improve those equation system entries that continue tobeneﬁt from improved integration as well as improved storage accuracy. There is neither a global,uniform numerical integration precision nor a uniform global (IEEE) ﬂoating point format. Thewhole mindset unfolds its beauty once we stop storing multigrid’s matrices explicitly. Instead, weembed the matrix entries into our mesh—a strategy we label as quasi matrix-free [37]—and storesolely diﬀerences to geometric operators. The information within any low accuracy stencil, no mat-ter whether suﬃcient or not yet available with higher accuracy, can then be encoded with few bytes3nly. Overall, our contribution is three-fold:1. We allow for a solver start without an expensive pre-solve assembly phase, i.e. we eliminatealgorithmic latency;2. we increase the code’s concurrency level as we decouple the actual assembly process from thesolve and make it feed into the latter anarchically, i.e. as soon as results become availableyet without any synchronisation or temporal ordering;3. and we bring down the memory footprint. This notably aﬀects the initial steps which tendto be cheap anyway, as dynamic grid reﬁnement just starts to unroll the real compute grid.To the best of our knowledge, this is the ﬁrst algorithmic blueprint systematically exploring howto incorporate problem-dependent numerical integration of ﬁne grid stencils, adaptive coarse gridoperator computation and non-IEEE storage formats without the cost of any arduous assembly intoone multigrid implementation.The remainder of the paper is organised as follows: We ﬁrst discuss potential caveats of ourphilosophy in Section 2 that we want to keep in mind and address throughout the paper. Wethen review published and related work, which contextualises the present work, explains whereideas come from and how the proposed techniques ﬁt to other activities. All details required tounderstand the novel ideas of our work are introduced together with our multigrid algorithm ofinterest in Section 4. From there, we establish our notion of a delayed, iterative stencil assemblywith ﬂexible precision (Section 5). We dedicate Section 6 to a discussion of the solver’s propertiesthat result from this novel assembly paradigm. A brief summary and an outlook close the discussion.

There are certain obvious caveats with our proposed solution:1. The matrix entry computation could be subject to starvation. If the results of new integrationtasks are not dropping in on time, the multigrid solver might converge towards the wrongsolution and prematurely signal terminate. If the integration is run with high priority andthus precedes the actual solution process, it is not subject to these concerns. Yet, it losesthe proposed selling point. One might argue that complex integration patterns typicallyarise only around submanifolds within the domain. Despite our remarks on a rapid decay oferrors, the elliptic nature of the problem of choice however implies that inaccurate integrationwithin a subdomain can pollute the entire solution. The starvation phenomenon thus has tobe analysed. We have to validate that the total number of cycles/iterations required is notincreased massively (item (2); Fig. 1).2. The motivation behind multigrid’s construction—we work with a cascade of coarser andcoarser grids where the coarse grids “take care” of global propagation—implies that any ﬁnegrid stencil modiﬁcation has its impact felt throughout all grid levels. As long we change thediscretisation, the exact nature of all coarse grid operators also continues to change. To avoidrepeated data accesses per sweep, we limit the multilevel propagation speed, i.e. our operatorupdates propagate from ﬁne to coarse one level per cycle. Updates ripple through the system.For multiplicative multigrid, such a one-level-at-a-time policy is reasonable, as the solver alsohandles one solution at a time. However, additive multigrid processes all resolutions in one4ush. Coarse operators thus are incorrect if operator changes are not immediately rolled outto all levels. Therefore we exclusively focus on additive multigrid here. It is more challenging.The rippling then has to be anticipated carefully—in particular once we work with dynam-ically adaptive meshes—as we run into risks of coarse grid updates pushing the solution inthe wrong direction, i.e. would yield inconsistent update rules (item (1); Fig. 1). Non-linearsetups would yield the same issues.3. Our techniques are not lightweight in a sense that they can easily be added to existing solversin a black-box fashion. This particularly holds for the usage of multiple precision formats, forwhich most software might be ill-prepared. While our experiments focus on structured adap-tive meshes resulting from spacetrees only, it is clear that they apply directly to unstructuredmeshes even if they lack a built-in hierarchy, since both geometric and algebraic coarseningconstruct a hierarchy anyway. Our accuracy dependencies and recomputation needs followthis hierarchy. It is however clear that the integration of the ideas into existing unstructuredmesh software requires additional eﬀort and software design.Our experiments suggest that the ﬁrst item is not observed in practice, even though our workstudies a worst-case setup with our additive MG solvers. We propose a solution to the secondcaveat. For the third item, our work provides clues for what solver development roadmaps mightwant to incorporate in the future.

Early work by Achi Brandt [4] —speciﬁcally his work on MLAT—already clariﬁes that “discretiza-tion and solution processes are intermixed with, and greatly beneﬁt from, each other” in an adaptivescheme. This principle is not exclusive to MLAT. It holds for all forms of adaptive mesh reﬁnement(AMR). If dynamic AMR starts with a coarse discretisation and unfolds the mesh anticipatingthe real solution’s behaviour, we can read this as delaying an exact computation of the ﬁne gridequations until we know that they are needed. Dynamic adaptivity also mangles the assembly withthe linear equation system solver.Within the solver world, the seminal additive scheme proposed by Bramble, Pasciak and Xu[3, 2] does not utilise exact equation representations on coarse grid levels. Multiple coarse gridcorrections are computed independently from the same ﬁne grid residual and eventually summedup. However, the coarse grid correction equation in a multigrid sense results from a simple h − scalingof the ﬁne grid equation’s diagonal. This approximation is suﬃcient for convergence in many cases[33]. Starting with inexact equations or setups and improving them subsequently is not new withinthe multigrid-as-a-solver community either. Adaptive AMG [8, 9] constructs coarse grids iteratively.A tentative coarsening setup is improved by applying it to a set of candidate vectors for a smallnumber of iterations. Similarly, Bootstrap AMG [6] modiﬁes both prolongations and the coarsegrid hierarchy itself by applying them to randomly constructed problems and modifying them toimprove convergence. In both Adaptive AMG and Bootstrap AMG, all modiﬁcations of the coarsegrid equations happen during an extended setup phase rather than during the iterations of thesolution process.Inaccurate operator approximations are popular when solving non-linear equations. In the Inex-act Newton Method [11, 23] for example the Jacobian is approximated once and then used within aniterative process yielding a sequence of corrections to the solution. Along the same lines, multigrid5or non-linear problems often uses multigrid within a Newton solve where the non-linear operator islinearised, i.e. approximated. Finally, some multigrid techniques for convection-dominated scenar-ios symmetrise the underlying ﬁne grid phenomenon [41] before they make it subject to algebraiccoarsening. The motivation here is to stabilise the solver, but it obviously constructs a regime withinaccurate coarse grid operators. Our solver overview is far from comprehensive.For time-dependent problems, many established (commercial) codes still employ direct solversthat store an explicit inverse of the system matrix. This is particularly attractive in scenarioswhere the mesh does not change, as the inversion of a matrix preceding a new time step’s solutionis performed only once. After that, we merely apply the known inverse, i.e. rely on a matrix-vectorproduct. The massive memory footprint required to store an inverse of a sparse system makes thisapproach quickly prohibitive. However, applying multigrid to each and every time step also yieldsexcessive costs, as the solution changes smoothly in time. We therefore have previously studied a“multigrid” concept where a single level smoothing step is followed by a two-level scheme, followedby another single level solve, a four level scheme, and so forth [39]. The coarser a grid level, the moreit aﬀects future solutions even though we do not update the underlying operator in time anymore.Such a scheme translates multigrid’s h -coarsening idea into the time domain. Most approaches fromthe parallel-in-time community (see [34] and follow-up work for newer trends) also exploit the ideato approximate the coarse operators crudely, but to incrementally improve them.On the implementation side, a lot of mature software supports matrix-free solvers today. PETScfor example phrases its algorithms as if it had a fully assembled matrix at hand. However, many ofits core routines make no assumptions about how this matrix is actually assembled. In its MatShellvariant, it speciﬁcally allows users to deliver matrix parts on-demand [1], i.e. PETSc asks for thematrix part, applies it to the vectors of interest, and immediately discards the operator again.While such an approach allows users to (re-)compute all operators whenever they are required, asole matrix-free implementation runs the risk to quickly become ineﬃcient or unstable. If materialparameters in (1) change rapidly, any on-demand operator evaluation has to integrate the underlyingweak formulation with high accuracy. This quickly becomes expensive. On the coarser grids, a solematrix-free mindset means that operators cannot depend on the next ﬁner operator following a Ritz-Galerkin multigrid strategy, as this next ﬁner operator is not available explicitly either and willrecursively depend on even ﬁner levels. For these shortcomings, the hybridisation of algebraic andgeometric multigrid is well-trodden ground [16, 22, 35, 29]. Most hybrid approaches use algebraicmultigrid where sole geometric operators fail, but employ geometric operators wherever possible. Asmaterial parameters change “infrequently” on ﬁne meshes and diﬀusion dominates in many setups,it is indeed possible to rely on a geometric operator construction for the majority of the equationsystems.Our own work in [37] introduces an alternative notion of hybrid algebraic-geometric multigrid.All operators here are embedded into the grid and, in principle, algebraic. As they are encodedwithin the mesh, we can access them within a matrix-free mindset. Storing the operators relativeto geometric operators in a compressed form allows us to work with a memory footprint close togeometric multigrid. The present paper follows-up on this strategy and fuses the underlying ideaof lossy compression with iterative operator assembly. It stores operators with reduced precisionwhere appropriate. We have previously explored lossy low precision storage both in a multigrid [37]and an SPH [15] context. Both approaches demonstrate how much we can save in terms of memoryfootprint, and both publications point out that the price to pay for this is an increased arithmeticworkload. In the SPH context, we propose and prototype how the additional operators can bedeployed to tasks of its own. However, the strict causal dependency there implies that performance6enalties can and do arise. Performance penalties resulting from the increased compute load—it isnotably more expensive to check to which degree we can store an operator with reduced precision—do not arise in the present case, as we propose an anarchic scheme where we ignore dependenciesof the actual solve on the newly introduced stencil assembly tasks plus any storage loss analysis. Let A h u h = f h describe the equation system that arises from a nodal Ritz-Galerkin ﬁnite elementdiscretisation of (1). The test and shape function space are the same. We use d -linear functions.Hence, we obtain the classic 9-point or 27-point stencils on regular meshes. Each stencil describesthe entries of one row of A h . It describes how a single point (vertex) on one level depends on itscell-connected neighbours. The solution vector u h stores the weights (scalings) of the individualshape functions. This simple choice of mathematical ingredients allows for a multitude of multigridﬂavours already. We classify some of these ﬂavours and point out which ﬂavours we study in thepresent paper. The list is not comprehensive but focuses on solver nuances which are aﬀected bythe proposed techniques. Geometric vs. algebraic construction of multilevel hierarchy.

Multigrid solvers can be classiﬁedinto either mesh-based coarsening or solvers with algebraic coarsening [7, 30]. The former rely onan existing cascade of coarser and coarser geometric meshes Ω h , Ω kh , Ω k h , . . . which often embedinto each other. Algebraic coarsening derives the coarse meshes from a connectivity analysis of A h , i.e. directly from the matrix without a geometric interpretation. In our work, we stick to ageometric approach relying on spacetrees [40]: We take the domain Ω and embed it into a cube. Thecube yields a mesh Ω without any real degrees of freedom, as all vertices either discretise pointsoutside of the domain or coincide with the domain boundary. We cut the cube into k equidistantslices along each coordinate axis to end up with k d new cubes which form Ω . On a cube-by-cubebasis we continue recursively yet independently.The construction process yields a tree of cubes which deﬁne a cascade of grids Ω , Ω , . . . , Ω (cid:96) max that are embedded into each other. We call the subscript (cid:96) in Ω (cid:96) the grid level, i.e. the smaller theindex the coarser the mesh [18, 36], and make a few observations: The individual grids embed intocoarser grids (Ω (cid:96) − ⊂ Ω (cid:96) ). Individual meshes Ω (cid:96) might cover only parts of the domain and mightyield disjoint submeshes. The union of all meshes (cid:83) (cid:96) Ω (cid:96) = Ω h is an adaptive mesh. By subsequentlyremoving the largest level (cid:96) from the union, we can construct our coarse grid hierarchy in a geometricmultigrid sense. With this level deﬁnition, it is convenient to label u (cid:96) with the level instead of ageneric h subscript. Correction vs. full approximation storage realisations.

Multigrid solvers can be classiﬁed intocorrection schemes and full approximation storage (FAS) realisations[5, 36]. The latter operate ona solution to the PDE on each and every level, whereas in a classic correction scheme the coarsegrid weights have solely correcting semantics. We make all inner and boundary points of each meshΩ (cid:96) carry a d -linear shape function. (cid:83) (cid:96) Ω (cid:96) = Ω h hence spans a generating system where multiplevertices (and therefore weights) coincide spatially yet are unique due to their level. Let P (cid:96) +1 (cid:96) denoteprolongations of data on level (cid:96) onto level (cid:96) + 1. As there is no guarantee that all cubes of the meshΩ (cid:96) are reﬁned, the operator might only take a subset of Ω (cid:96) and transfer it onto the next resolutionlevel. Let R (cid:96) − (cid:96) be a restriction, i.e. the counterpart operation to P (cid:96) +1 (cid:96) . Again, it aﬀects only thoseregions of Ω (cid:96) that are reﬁned further. There are three natural implications of this setup: (i) We canmake the overall solution to the PDE unique by enforcing u (cid:96) − ( x ) ← u (cid:96) ( x ) for every vertex pair7hat coincides spatially. We write this down as u (cid:96) − ( x ) = I (cid:96) − (cid:96) u (cid:96) ( x ). I is the injection operator[17] (also named “trivial restriction” [18]). (ii) We can simply interpolate weights from u (cid:96) − to allhanging vertices on level (cid:96) . Hanging vertices, i.e. vertices with less than 2 d adjacent cells on thesame level, do not carry any real shape functions. Yet, we temporarily augment them by truncatedshapes such that a weak Ritz-Galerkin formulation for their neighbouring non-hanging vertices onthe same level makes sense. (iii) We can exploit the Ritz-Galerkin coarse grid operator deﬁnitionand implement Griebel’s HTMG [17] straightforwardly: A (cid:96) − (cid:124) (cid:123)(cid:122) (cid:125) = R (cid:96) − (cid:96) A (cid:96) P (cid:96)(cid:96) − u (cid:96) − = A (cid:96) − e (cid:96) − + A (cid:96) − I (cid:96) − (cid:96) u (cid:96) = R (cid:96) − (cid:96) r (cid:96) (cid:124)(cid:123)(cid:122)(cid:125) = f (cid:96) − A (cid:96) u (cid:96) + R (cid:96) − (cid:96) A (cid:96) P (cid:96)(cid:96) − I (cid:96) − (cid:96) u (cid:96) = R (cid:96) − (cid:96)  f (cid:96) − A (cid:96) ( u (cid:96) − P (cid:96)(cid:96) − I (cid:96) − (cid:96) u (cid:96) ) (cid:124) (cid:123)(cid:122) (cid:125) ˆ u (cid:96)  This is an elegant rephrasing of FAS relying on two diﬀerent types of residuals: the standardresidual r (cid:96) = f (cid:96) − A (cid:96) u (cid:96) guides any iterative update on a level, while its hierarchical counterpartˆ r (cid:96) = f (cid:96) − A (cid:96) ˆ u (cid:96) feeds into its next coarser equation. Both result from the same discretisation stencil.FAS traditionally is introduced as a tool to handle non-linear equation systems. It also pays oﬀ forin-situ visualisation as it holds the solution in a multiscale representation and thus can allow usersto zoom in and out. It encodes levels of detail. We use FAS as it makes the handling of hangingmeshes simple [28, 37]: Fine grid vertices adjacent to a reﬁnement transitions on the ﬁnest levelhave two diﬀerent semantics: They carry correction and solution weights. Both quantities work indiﬀerent regimes—the solution weights converging towards the “real” value while the correctionsapproach zero. With FAS, we do not have to distinguish/classify them. This argument rolls overrecursively to all grid levels. Rediscretisation and geometric projections between levels vs. Ritz-Galerkin plus algebraic oper-ators.

Multigrid solvers can be classiﬁed into solvers using rediscretisation plus geometric transferoperators vs. solvers using algebraic inter-grid transfer plus Ritz-Galerkin correction operators[12,10, 30]. If we use geometric transfer operators, P and R —we omit the indices from hereon unless notobvious from the context—are d -linear. If we use rediscretisation, A (cid:96) stems from a Ritz-Galerkinﬁnite element discretisation with the same type of shape and test functions on each and every level.If we use algebraic inter-grid transfer operators, P has to be constructed such that any projectionof a correction on level (cid:96) − A (cid:96) ’s nullspace on level (cid:96) . We use BoxMG[12, 13, 41] to approximate such operators. The restriction is either the transpose of P or a sym-metrised version of it [41]. A Ritz-Galerkin coarse grid operator A (cid:96) − = RA (cid:96) P is computed fromthe ﬁne grid operator plus the (possibly algebraic) restriction and prolongation. Thus, it mimicsthe ﬁne grid operator’s impact on a coarser solution. When using rediscretisation plus geometricinter-grid transfer operators one “hopes” that the resulting coarse grid operator exhibits similarproperties.BoxMG plus Ritz-Galerkin yield the same operators as geometric rediscretisation on our meshesif (cid:15) ∈ Ω is constant over the domain while it requires the absence of further PDE terms such asconvection, non-linear terms and anisotropic material. The ﬁner the grid the more dominant thesecond order term in most PDEs. Furthermore, the ﬁner the grid the “smoother” or rarer the (cid:15) transitions—unless we have totally random or noisy (cid:15) distributions. As a result, there are often(ﬁne grid) sections within our cascade of meshes where geometric and algebraic operators are indeedthe same or very close even though complex equation terms and material are present.8 ultiplicative vs. additive schemes.

Multigrid solvers can be classiﬁed into additive solvers andmultiplicative ones [2]. Additive solvers determine the equation system’s residual on the ﬁnest mesh (cid:96) max , restrict this single residual to all levels (cid:96) , and determine a correction on all levels concurrently.Finally these corrections are then added up, subject to a suitably deﬁned prolongations. Thesimplest multiplicative solvers determine the residual on a level (cid:96) = (cid:96) max , smooth it, computean updated residual, restrict this residual to the right-hand side of the next coarser level, correctthe solution there by applying the scheme recursively, and ﬁnally add all corrections subject to aprolongation to the ﬁne grid estimate.Multiplicative solvers in general converge in fewer iterations than their additive cousins, as anupdate on one level propagates through to other levels prior to updates on those levels. A concep-tional disadvantage of multiplicative multigrid is that solver steps on the coarser mesh resolutionlevels tend to struggle to exploit all hardware concurrency. Cores start to idle. Any strategy yield-ing tasks thus can expect that these tasks exploit idling cores at one point. This is one motivationfor us to focus on improving concurrency for additive schemes—if the approach works for additivesolvers, it pays oﬀ for multiplicative schemes too. Additive solvers are often used solely as preconditioner, as they yield inferior convergence and facesevere stability problems. They tend to overcorrect the solution and this overcorrection is moresevere for increased number of grid levels [28]. It is thus convenient to damp the updates of theequation systems by a factor ω that depends on the grid levels. A convenient choice is to use an ω onthe ﬁnest level, ω on the next coarser one and so forth. While this prevents overshooting, it tendsto destroy multigrid convergence for larger systems as the elliptic nature of (1) implies that anylocal change propagates through the whole domain. With shape functions with local support, thispropagation can only be realised on coarser levels. However, exponential damping with ω, ω , ω , . . . and 0 < ω < u (cid:96) ← u (cid:96) + S (cid:96) ( R ˆ r (cid:96) +1 − A (cid:96) u (cid:96) ) (cid:124) (cid:123)(cid:122) (cid:125) Standard additive scheme with FAS − ˜ P ( S (cid:96) − ˜ R ( R ˆ r (cid:96) +1 − A (cid:96) u (cid:96) )) (cid:124) (cid:123)(cid:122) (cid:125) Additional additive damping of correction . (2) S (cid:96) here denotes the smoother approximating the impact of A − (cid:96) . Jacobi yields S (cid:96) = ωdiag − ( A (cid:96) ),i.e. here is where the original damping ω enters the equations. The scheme (2) projects a residual r (cid:96) immediately one level further—possibly using a modiﬁed restriction ˜ R —before it evaluates theadditive multigrid term on level (cid:96) . When we update the solution on level (cid:96) , we damp our actualupdate with the projection of an update step of this further restricted equation. The idea is thatthe auxiliary equation mimics the overshooting potential of the real additive correction runningconcurrently through the augmented ˜ R . To achieve this, we either apply a smoothed prolongationoperator ˜ R to our auxiliary equation—as we use a Jacobi smoother to construct such a ˜ R , wespeak of adaFAC-Jac—or choose ˜ P , ˜ S and ˜ R such that the resulting smoother resembles a BPX-like scheme [27]. The latter case uses solely P and the injection I for the auxiliary equations. Wethus refer to it as adaFAC-PI. 9daFAC-x, i.e. adaFAC-PI and adaFAC-Jac, are interesting additive multigrid ﬂavours, as theyuse Galerkin based operator constructions twice per level. As a consequence, any inaccurate op-erator representation has twofold knock-on eﬀects on coarser levels. With BoxMG and geometricoperators, we have diﬀerent combination opportunities to construct our operators, and it is clearthat the algebraic BoxMG variant is subject to multifaceted input approximation inaccuracies ifﬁne grid stencils are not determined correctly. If all operators are geometric and we rely on rediscretisation on every level, we can implementthe multigrid scheme in a matrix-free way, once we embed the entries of the vectors u (cid:96) , r (cid:96) , ˆ r (cid:96) , . . . directly into the mesh vertices. Let our program traverse the mesh. Whenever we enter a cell, weload its adjacent 2 d vertices. Each vertex holds its corresponding u, r, . . . entry, i.e. one scalar perentry. Hanging vertices hold interpolated weights or zero, respectively. As rediscretisation lacksdependencies on other operators we can construct all the multigrid operator ingredients we needon-the-ﬂy. We compute them, apply them to the local data, and immediately add the impact ofthe operator application (matrix-vector product) to the residual data within the mesh. Once all 2 d cells surrounding one vertex have been traversed, we can directly apply our Jacobi point-smoother,restrict or prolongate all data [40, 28, 37]. We traverse our grid cells and thus accumulate thematrix-vector impact cell-wisely. For this, we require cell-wise (element-wise) stiﬀness matrices.Let S denote a smoothing task of the multigrid algorithm, i.e. a task that realises the updateof S (cid:96) from (2) for one particular vertex/degree of freedom only. Let A (geo) denote a (geometric)assembly task for a single element that is adjacent to the designated vertex/degree of freedom. Ageometric, matrix-free implementation of multigrid then issues a series of S◦ ( A (geo) + A (geo) + . . . + A (geo) ) (cid:124) (cid:123)(cid:122) (cid:125) d assembly tasks for the 2 d cells adjacent to one vertex + S◦ ( A (geo) + . . . + A (geo) )+ S◦ ( A (geo) + . . . + A (geo) )+ . . . (3)tasks over the grid, i.e. unknown, entries. All operators are to be parameterised over the levels.As we work element-wise, we do not evaluate each cell task A (geo) d times. Instead, we set upthe element matrix once, and immediately feed its impact on the surrounding u h values in the2 d residuals. The smoother S then acts on the residuals. An assembly of the inter-grid transferoperators is omitted, as we know these operators and can hard-code them. The assembly determinesone stencil, i.e. integrates (cid:90) Ω h (cid:15) ( ∇ u, ∇ φ ) dx = (cid:88) c ∈ Ω h (cid:90) c (cid:15) ( ∇ u, ∇ φ ) dx. (4)over 2 d cells c . In our implementation, we exclusively work with an element-wise assembly where A (geo) computes the outcome of 4 over one cell c . Equation (3) has 2 d tasks that feed into onesmoother though each A (geo) task which in turn feeds into 2 d S tasks—one for each vertex that isadjacent to the cell. It is convenient to evaluate each cell operator once, feed it the 2 d follow-upsteps, and thus to remove redundant tasks. We stress the entire procedure remains inherentlyadditive however.With explicit assembly, we run( S ◦ . . . ◦ S ) ◦ (cid:16) . . . + A (geo) k h + A (geo) kh + A (geo) h (cid:17) . (5)10igure 2: Exact material parameter within a cell (left) and a splitting of the material parameterinto n d quadrants for numerical integration (right).In (5), the task symbol A (geo) (cid:96) is a supertask bundling the evaluation of all the A (geo) tasks on level (cid:96) . This formalism relies on the insight that we can read multigrid cycles as iterations over onelarge equation system comprising all coarse grid equations if we commit to a generating system[17]. The addition in the assembly illustrates that submatrices within this large equation systemcan be constructed (assembled) concurrently, as the individual cells on all levels are independent ofeach other.To make a ﬁnite element discretisation consistent, the assembly A (geo) has to evaluate (4) overall cells consistently, i.e. in the same way for all of a cell’s adjacent vertices/stencils. This is trivialfor constant (cid:15) , as we can extract weights and (cid:15) from the integral and integrate over the remainingshape functions analytically. That is, if we know (cid:15) within a vertex, we can precompute (4) for (cid:15) = 1and scale it upon demand. If (cid:15) (cid:54) = const , (4) the computation is less straightforward and typically hasto be computed numerically. For this, one option is to approximate (cid:15) . A polynomial approximationmakes limited sense, as we are particularly interested in sharp (cid:15) transitions (material parameterjumps). Higher order polynomials would induce oscillations. We can however approximate (cid:15) as aseries of constant values, i.e. we subdivide each cell into a Cartesian, equidistant subgrid with n d volumes. Per volume, we assume (cid:15) to be constant. For n = 1, such a subcell integration is equivalentto sampling (cid:15) once per cell centre (Fig. 2). The numerical integration can be expensive—not dueto the arithmetic load but due to the fact that (cid:15) lookups might be memory-access intense and thusslow—which strengthens the case for an element-wise realisation of the assembly, i.e. it is betterto make A (geo) act per cell and feed into the 2 d adjacent vertices rather than computing (4) per S invocation. The latter option would eﬀectively integrate (4) 2 d times.As we know that diﬀerent (cid:15) distributions require diﬀerent choices of n , it is convenient toparameterise the assembly tasks as A (geo) ( n ). A fast assembly—either explicit or embedded intothe solves—requires the evaluation of A (geo) ( n ) to be fast; in particular as we evaluate each taskonce per cycle, i.e. multiple times overall. Therefore, it is in the interest of the user to choose n as small as possible— A (geo) ( n )’s workload is in O ( n d )—yet still reasonably accurate. Along thesame lines, it is possible to distinguish diﬀerent assembly realisations by means of their accuracy,i.e. whether we run them in double or single precision. An element-wise traversal of the set of meshes { Ω , Ω , . . . , Ω (cid:96) max } deﬁnes an ordering on the cellsof the mesh. It yields a stream of cells. Additive multigrid’s promise is that it exhibits a higherconcurrency level than multiplicative schemes. Consequently, there’s no order constraint on thecell enumeration within the stream; diﬀerent to multiplicative multigrid where the pre-smoothingof cells within a level (cid:96) precede cells of level (cid:96) − B C DE F G HIAB CDE FG H I

Figure 3: The tree nodes of a spacetree (right) span the multilevel adaptive Cartesian mesh(left) and also carry the local assembly matrices and P and R operator parts. Fine grid nodes(B,C,E,F,G,H,I) carry geometric operators, while all others carry algebraic operators. If tree nodesreﬁne (node E), then the operators of the reﬁned node (E) have to switch from geometric toalgebraic, which in turn requires recomputes of all coarser ancestors within the tree (A and D).oﬀ to realise such a partial level ordering, as it allows us to also integrate the bottom-up residualrestriction within the stream. If we run the grid traversal on a parallel machine, the cell stream issplit up into multiple streams. Obviously, the cell ordering yields the exact data access pattern forthe vertices [40, 38], too, but this fact is not of primary interest here.As soon as permanent recomputation of the stencils becomes too expensive, we have to memo-rise, i.e. store the assembly matrices. If we continue to avoid the maintenance of an explicit matrixdata structure, it is natural to store the local element matrices directly within the tree cells. Weembed the stencils into the cell stream. This avoids the memory overhead of matrix data struc-tures requiring meta data, sparsity pattern information and so forth, but it still has to pay for allactual matrix entries. They just are encoded within the mesh rather than within a dedicated datacontainer. This approach works as long as we can guarantee that the cells are always visited in thesame order (per core/rank) by the traversal. We work with our custom mesh and linear algebraimplementation here. Such a constraint however would allow the scheme to be realised within othersoftware oﬀering matrix-free work through callbacks, too.Once we have introduced this machinery, the in-stream storage can hold the algebraic operators A (alg) rather than their geometric counterparts: on the ﬁnest mesh level, we make A (alg) (cid:96) max = A (geo) (cid:96) max .On all other levels, we make A (alg) (cid:96) max hold the Ritz-Galerkin operator instead of a rediscretisation. Fi-nally, we can use the same implementation technique to hold algebraic inter-grid transfer operators,too. If all redundancies are eliminated—if we split up nodal operators over a vertex’s adjacent cells,the distribution of the stencil is never unique—the memory footprint per cell within the stream thusgrows by a factor of d +2(2 k − d d . We can now hold P and R explicitly within the stream however.The present discussion introduces a storage scheme yet ignores the causal dependencies in thecomputation of the stored data (Fig. 3). That is, we assume that all data held in-situ is readilyavailable; an assumption we explicitly pointed out to be wrong. A discussion on the computationrules plus the validity of all operators is handled within the subsequent section. Once we work withdynamically adaptive meshes, a task-based formalism inserts grid update tasks U into the taskgraph. These tasks either remove cells from the mesh or add cells. In the former case, they also12igure 4: Schematic illustration of classic multigrid’s task orchestration over six cores. Time runsfrom left to right. A (geo) (cid:96) max denotes a stencil integration over a single cell on level (cid:96) max , A (alg) (cid:96) denotesan algebraic operator assembly (using A (alg) (cid:96) +1 ), S (cid:96) the application of one smoothing step on one cell,i.e. the matrix-vector product feeding into the actual smoother. The assembly feeds into the ﬁrstsmoothing step, i.e. has terminated before we start cycling.remove entries from the stream. In the latter case, they insert entries. In a multigrid environmentwith algebraic operators, U tasks change entries of existing stencils: removing cells switches Ritz-Galerkin correction operators into discretisation stencils; adding cells switches the held operatorsthe other way round. Both switches imply coarser operators changes that cascade, as the Ritz-Galerkin condition ensures that coarser operators depend on ﬁner ones. This information updatepattern is the subject of the discussion that follows. To gain ﬂexibility when and with what accuracy we assemble matrix entries, we introduce per meshcell c a marker p ( c ) = ( p , p , p )( c ) ∈ (cid:8) N + ∪ {⊥ , (cid:62)} (cid:9) × {⊥ , (cid:62)} × { , , , , . . . , } . We reiterate that we operate in a generating system/spacetree context, i.e. there are cells on eachand every mesh level and diﬀerent levels logically overlap. The tuples are embedded into the cellstream as headers, i.e. they precede the matrix entries, and the tuple entries have the followingsemantics: p If the ﬁrst entry holds ⊥ , we have not yet performed any integration of (4) with any dis-cretisation n . As a consequence, the matrix entry stream (matrix linearisation) does not holdentries for the corresponding stiﬀness matrix. If the ﬁrst entry holds (cid:62) , our algorithm hasconcluded that this stencil is computed with suﬃcient accuracy already. The correspondingentries are found as the next 3 d or 3 d + 2( k − d entries within the stream. If the ﬁrst entryholds a natural number p ( c ) = n , the matrix stream hosts the entries of the matrix, too. In13igure 5: Left: Construction of our delayed assembly. The stencil integration A (cid:96) max is brokendown into iterative substeps starting with a low-order approximation A (1) (cid:96) max . We intermingle themwith the ﬁrst multigrid smoothing steps. Some stencils require further, more accurate integration A ( n ) (cid:96) max . Each stencil update requires us to recompute the algebraic coarse grid operators. Right:Example task orchestration/execution scheme on a six core system.this case, these entries result from an integration of (4) with an n d subgrid, and we cannotbe sure yet that this n is suﬃciently large. p The second entry holds an atomic marker. If it is set to (cid:62) , there is a task spawned into thetask system which is currently computing a new approximation to the stencil, i.e. the codeis in the process of evaluating (4). If the entry equals ⊥ however, no update of the stencil iscurrently scheduled. The constraint p ( c ) = (cid:62) ⇒ p ( c ) = ⊥ holds. p For p ( c ) = ⊥ , the third entry has no semantics. Otherwise, it encodes in which precisionthe matrix entries are encoded. The marker determines how many bytes we have to continueto read within the stream to obtain our element-matrices and how these bytes have to beconverted into ﬂoating-point numbers. Let all cells initially carry p ( c ) = ⊥ . The classic assembly phase (5) determines all multigridoperators prior to the (ﬁrst) assembly (Fig. 5, left), and implicitly sets ∀ c : p ( c ) = ( (cid:62) , ⊥ , . Here, lazy denotes an on-demand evaluation offunctions just before their result is required. In the present case, this means that the local assemblymatrix is computed just prior to its ﬁrst usage and then embedded into the stream for future use.Given a ﬁxed, global n >

0, a lazy multigrid code tests in every cell whether p ( c ) = ⊥ beforeit runs the local assembly or matrix-vector product. If p ( c ) = ⊥ , we either integrate (4) withan n d subgrid or compute the multigrid stencils plus the inter-grid transfer operators due to theBoxMG/Ritz-Galerkin formulation. Immediately after that p ( c ) ← ( (cid:62) , ⊥ , Our introductory paper sketching the delayed assembly strategy for the ﬁrst time labelled the strategy as“lazy”. It is however not lazy as it eventually computes all required matrix entries, and the term lazy furthermorehas diﬀerent semantics in programming languages. We thus dropped it in favour of “delayed”. (cid:96) are available whenwe hit (cid:96) − A (cid:96) − consisting of both the constructionof the element-wise operators plus inter-grid transfer operator entries of R (cid:96) − (cid:96) and P (cid:96)(cid:96) − is thus bydeﬁnition already a series of ready tasks. They have no incoming, unresolved dependencies. Theycan be executed straight-away. The observation holds for both geometric and algebraic multigridoperator variants.For additive multigrid, this straightforward lazy stencil integration works if and only if we stickto rediscretisation and geometric transfer operators. It breaks down as we switch to algebraicoperators, unless we prescribe the order the levels are traversed, i.e. unless we ensure that thetraversal of level (cid:96) is complete before we move to level (cid:96) −

1. We may weaken this statement [40, 28]and enforce that only those elements from level (cid:96) + 1 within the input of a chosen vertex’s local P are ready. While this might be convenient for many codes, it eliminates some of additive multigrid’sasynchronicity and thus one of its selling points.With a lazy stencil integration, we can only utilise geometric coarse-grid operators in the ﬁrstcycle.Lazy evaluation is a particular ﬂavour of a delayed operator assembly where we try to postponethe assembly as long as possible. If we weaken the information ﬂow constraints within the assemblyor the accuracy demands on the ﬁne grid operator, we can construct further ﬂavours of a delayedassembly: If a proper global choice of n for the ﬁne grid (as well as for the rediscretisation if we stick togeometric operators) is not known a priori, we can employ an adaptive parameter selection:Let A (geo) ( p ) denote the assembly of the local assembly matrix of one cell. For the evaluation of(4), it is parameterised by p , or p ’s ﬁrst tuple entry, respectively, i.e. by the numerical subsamplingfactor for the cell. For an adaptive stencil integration, we make A (geo) ( p ) accept the marker plusthe current local element matrix as stored within the stream if p ( c ) (cid:54) = ⊥ . For p ( c ) = ⊥ , it issolely given the marker. A (geo) ( p ) returns a new local element matrix plus transfer operators, ifrequired, plus an updated marker p according to the following rules: p ( c ) = (cid:62) The task returns immediately and we continue to work with the existing assemblymatrix encoded in the cell stream. p ( c ) = ⊥ Assign p ( c ) ← p ( c ) ∈ N + The task evaluates (4) over the cell of interest. It uses a ( n + 1) d sub-grid with n = p ( c ) to discretise (cid:15) . The result is stored in a new element matrix A (new) . Next, thetask reads the previous matrix A (old) from the stream and compares the two matrices in themaximum element-wise matrix norm. This determines the new marker p ( c ) ←  (cid:62) if (cid:107) A (new) − A (old) (cid:107) max (cid:107) A (old) (cid:107) max < Cn + 1 if (cid:107) A (new) − A (old) (cid:107) max (cid:107) A (old) (cid:107) max ≥ C (6)15inally, we keep A (new) instead of A (old) within the stream, i.e. future work will utilise theformer.It is obvious that a parameterisation of P (cid:96) +1 (cid:96) and R (cid:96)(cid:96) +1 makes limited sense. However, BoxMGmakes both operators depend directly on the operator on level (cid:96) + 1. This dependency propagatesthrough all the way to the ﬁne grid. Therefore, both P and R depend indirectly on the n choice ofthe algorithm.After at most n max · ( (cid:96) max −

1) steps ( n max = max c { p ( c ) } ), all local equation systems are valid,if the grid is stationary and we tackle a linear problem. n max is the maximum integration accuracyover all cells that is required eventually. It is not known a priori.The scheme describes an adaptive quadrature rule, where the accuracy of the integrator is cell-dependent and determined by an iterative process. This iterative process terminates as soon as afurther increase of the accuracy does not yield signiﬁcantly improved stencils anymore. With the iterative scheme at hand, it is straightforward to construct an asynchronous stencil in-tegration, where the actual integration is deployed to a task of its own and runs in parallel to thesolver’s iterations (Fig. 5, right). We augment our previously outlined adaptive integration suchthat p ( c ) = ⊥ holds initially for all cells. Whenever we require a new stencil, we set p ( c ) = (cid:62) ,start the computation and set p ( c ) = ⊥ upon completion. In a synchronised setup, any checkof p ( c ) waits (blockingly) upon p ( c ) = ⊥ . p ( c ) = ⊥ holds initially for all cells. As long as p ( c ) = ⊥ , either no assembly task has ever been launched or the task yielding the new, improvedrepresentation has already terminated. As long as p ( c ) = (cid:62) , we wait (and do meaningful otherwork in a reasonable task environment).Alternatively, we can operate in an anarchic fashion and not wait upon the p ( c ) entry. Otherthan the initial stencil computation, we launch stencil tasks using nowait semantics, i.e. we do notwait for this subtask to terminate. Each task determines a new matrix A (new) , compares A (new) to the previously computed one, stores the new matrix, updates the p marker and ﬁnally sets p ( c ) ← ⊥ . It realises the third step of our adaptive integration. If the advanced integration of (4)has not terminated on time, our solver continues to use “old” stencil approximations.With an anarchic stencil integration, we have no control over when and with which integrationaccuracy we use. In either case, ﬁne grid stencil integrations are deployed to the background andonce they drop in, all aﬀected coarse grid operators become invalid in a Ritz-Galerkin sense.It is obvious that the iterative, delayed stencil integration can be applied to all levels if we stickto rediscretisation. With algebraic operators, the technique applies only to the ﬁnest grid level. Ifwe use iterative stencil integration on coarser levels, we have to control the termination criterionin (6) carefully: As the coarse equations are only correction equation and are “only” solved up toa mesh-dependent accuracy in classic multigrid terminology, it makes limited sense to choose thethreshold C there uniformly and small on all resolution levels. If we work in a Ritz-Galerkin plus BoxMG environment, we can either deploy the computation ofthe three arising coarse operators to background tasks, too, or we can recompute these operatorsin each and every cycle. Given the limited and deterministic computational load, the latter mightbe reasonable. 16 epresents a standard multilevel additive solveRepresents a speci ﬁ c level grid construction step Figure 6: Diagrammatic view of computing coarse grid equations prior to a solver iteration (left)compared to plugging into a grid traversal of the actual solver.If the operators however are determined in the background, we can spawn only those tasks thatmight actually yield changed operators. An analysed tree grammar [20] formalises the requirements:If a matrix update changes the stencil associated with a vertex v (cid:96) which in turn is in the image of astencil P (cid:96)(cid:96) − associated with a vertex v (cid:96) − , then the 2 d adjacent cells of v (cid:96) − should be ﬂagged. Inthe next multigrid cycle, all ﬂagged cells’ A (cid:96) − , P, R computations should be repeated, taking thenew ﬁne grid operator A (cid:96) into account. The same level-by-level information propagation—shownin Fig. 6—formalises how information propagates through both space and mesh resolutions if werecompute all three operators in each and every multigrid cycle.If we employ dynamically adaptive mesh reﬁnement, the mesh coarsening or reﬁnement induceschanges of operators. As a result, they implicitly trigger coarse grid operator updates. A similarargument holds for non-linear setups: If the non-linear component induces signiﬁcation changes inthe ﬁne grid operator—for many operators, this might be a localised eﬀect, i.e. the ﬁne grid equationsystem might not change everywhere—we have to change a set of aﬀected coarse grid operators.In both cases, it is important that we reset/remove p ( c ) = (cid:62) from all aﬀected coarse grid cells.Though p ( c ) ← p ( c ) (cid:54) = (cid:62) last used for a cell and to reset it to this value instead. In our implementation, we permanentlyrecompute all coarse grid operators.In an additive setting, delayed operator updates ripple through the equation system, i.e. theupdates propagate upwards by at most one grid level per cycle. From a certain point of view, coarsegrid operators on a level (cid:96) lag behind the ﬁne grid operators on level (cid:96) max by (cid:96) max − (cid:96) iterations. As the local matrix accuracy increases, the number of signiﬁcant bits grows commensurately. Thenumber of bits which carry actual information and not solely bit noise increases, too. If a localstiﬀness matrix was computed using small n values and there is rapid variation in (cid:15) over the cell,then it makes limited sense to store the whole matrix with eight bytes, i.e. double precision. Manybits represent accuracy that is not really there. The same argument implies that a computationof (4) can be initially done with reduced precision. If a coarse correction operator is inﬂuencedby some cells where the integration has not yet converged, it also makes limited sense to store thecorrection operator with full double precision. P, R and A kh could even be determined with halfprecision. However, deducing a priori which accuracy is suﬃcient is a delicate task. We propose adiﬀerent approach.Our code neglects the insight that we could use reduced-precision computing—this might be un-17easonable with new hardware generations support of half precision ( binary16 or bﬂoats). Insteadwe solely use double precision compute routines, but focus on the potential of the signiﬁcant bitsw.r.t. the memory footprint: In all of our previous descriptions, the third tuple entry of each cell isset to p ( c ) = 8. This means that we use eight bytes, i.e. double precision, to store each datum.Let the operators A, P, R be stored as sequences of ﬂexible precision values. If 2 ≤ p ( c ) <

8, eachoperator entry is encoded as follows: The ﬁrst bit is a sign bit. The subsequent 8( p ( c ) − − m bits hold the mantissa. The trailing eight bits hold the exponent. Our exponent is a plain signedchar which is chosen as close as possible to the unbiased IEEE exponent. The mantissa’s bits arethus non-normalised. If a double precision value is converted into this bespoke format, then wehold it as ( − s · m · e ≤ m < , i.e. we encode the datum as ( − s · ˆ m · ˆ e within the cell stream. ˆ e = e if − ≤ e < e is either −

128 or 127, which is the closest value to e . We ignore/remove the fact that e is storedbiased in the IEEE standard. Our ˆ e is stored as C++ char , i.e. without any shift. If 2 ≤ p ( c ) < m as a truncated representation as m : we store its p ( c ) − p ( c ).With this ﬂexible storage format, we can compute the maximum error δ induced by the newformat after we have computed a new stencil/matrix: Prior to any computation, we load the involvedmatrices from the cell stream and convert them to IEEE double precision. If an integration taskscomputes a new assembly matrix or if the Ritz-Galerkin/BoxMG routines determine a new matrix,the matrices are converted into a representation where each matrix entry consumes only p ( c )bytes prior to their use.Throughout the conversion we determine the maximum error δ introduced by our lossy com-pression. In line with the increase of the approximation accuracy n , we also increase the storageaccuracy: If one byte would have made a substantial diﬀerence, then we use a higher precision.Otherwise, we stick with the same precision. For this, we usually employ two magic thresholds:one threshold guides the adoption of n , one controls the increase of the storage footprint. Both areindependent of each other yet used within the same rules.Our modiﬁed storage scheme increases the number of bytes spent on the matrices where it isnecessary. In subdomains and/or grid levels where high accuracy is not required, i.e. where theoperators lack detail, we use a low-precision data format.It is clear that this approach is of limited value in its plain version. However, it becomes powerfulonce we apply this idea not to the operators but to the hierarchical surplus [37]. Let A ( n = 1) bea local assembly matrix of a cell with one sampling point for (cid:15) and P (geom) , R (geom) the operatorsresulting from d -linear interpolation and restriction. P (geom) and R (geom) can be hardcoded, while A ( n = 1) is cheap to compute. Instead of storing A ( n ) or the real P and R , we store only theirhierarchical surplus ˆ A = A ( n ) − A ( n = 1), ˆ P = P − P (geom) , ˆ R = R − R (geom) . These hierarchicalvalues typically are very small, i.e. the resulting error from the compression is very small, too. Mostof the cells do not require eight bytes per matrix entry to encode their (hierarchical) operators.This hierarchical representation ﬁnally motivates us to allow for entries p ( c ) = 0. Wheneverthe hierarchical surplus of an operator equals n = 1 integration or d -linear transfer operators,respectively, it holds a value close to zero. If this value under the Frobenius norm is smaller thanthe prescribed threshold, our converted, truncated format would encode a zero. In this case, weset p ( c ) = 0 and skip all storage within the cell stream. We need one byte for p only. This entireprocess is illustrated in Fig. 7. 18 tart Compute initialStencil Set p=1Deploy new integrationContinuetraversal Increment pDeploy new integrationContinuetraversal Continuetraversal ContinuetraversalCheckatomic ﬂ ag TrueFalse Compress stenciland check ﬂ oat precision Updatelocal stencil Figure 7: Flow chart of stencil integration logic whenever we enter a cell throughout the traversal.Figure 8: Left: Illustration of (cid:15) for θ ∈ { / , , , } . The smallest θ value is displayed at thebottom, the biggest on the top. Right: Adaptive mesh for our two-parameter setup where thediscontinuous material transition in (cid:15) is not mesh-aligned. All experiments are run on an Intel Xeon E5-2650V4 (Broadwell) with 12 cores per socket clocked at2.4 GHz. As we have two sockets per node, a total 24 cores per node is available. These cores share64 GB TruDDR4 memory. Shared memory parallelisation is achieved through Intel’s ThreadingBuilding Blocks (TBB), though we wrap up TBB with a custom priority layer such that we havevery ﬁne-granular control over which tasks are run when.For the experiments, we stick to −∇ ( (cid:15) · ∇ ) u = f on the unit square and supplement it with homogeneous Dirichlet conditions. The setup is initialisedwith noise from [0 , ].We use two test setups. In the ﬁrst one, we solvewith (cid:15) ( x ) = 1 + 0 . d Π di =1 e − θx i cos( πθx i ) . (7)The above setup is “simple” as the analytic solution is u ( x ) = 0 for f = 0. Yet, the parameter θ ≥ (cid:15) changes (Fig. 8). θ (cid:29) θ is the greater the number of integration points n per cell isrequired. However, the bigger θ is the more localised the signiﬁcant (cid:15) changes are, i.e. non-uniform n choices are mandatory. Finally, additive multigrid tends to overshoot signiﬁcantly if θ makes (cid:15) change with high frequency. Therefore we can readily expose any instabilities.In the second setup, we employ only two (cid:15) values, and we keep them constant over four regionsof the domain (Fig. 8). In two opposing regions, we choose (cid:15) = 1. The other two regions host (cid:15) ∈ { − , − , ..., − } . This setup is challenging as the jump between the material parameterscan be arbitrarily large and transition boundaries are not aligned with the adaptive Cartesian mesh.We cannot rely on the mesh to resolve the material changes properly. We have to rely on the localassembly matrices to accommodate the (cid:15) layout. As a consequence, a value of n = 1 within thefour subdomains is suﬃcient. We might temporarily use n = 2 to decide to switch p ( c ) ← (cid:62) .Where (cid:15) changes however, we easily obtain values of n ≈

20 before our algorithm decides that theapproximation quality of the jump is ﬁnally good enough.All data report normalised residuals. We measure the residual and divide it by the residualin the very ﬁrst cycle. The solver stops as soon as the initial residual is reduced by ten orders ofmagnitude. Our data usually reports the number of ﬁne grid solution updates also called degree offreedom (DoF) updates. Overhead workload due to coarse grids is not taken into account on anyaxis.

We kick our experiments oﬀ with some studies on dynamic termination criteria. Most codes ter-minate the solve as soon as the normalised residual runs under a given threshold or stagnates. Ifwe use delayed, asynchronous stencil integration, i.e. we do not wait per cell for the underlyingnext step of the integration to terminate, we thus run into the risk that we terminate the solveprematurely, i.e. before the right local assembly matrices have been computed. From an assemblypoint of view, this is a starvation eﬀect: The assembly tasks are issued yet have not been scheduledand thus cannot aﬀect the solve. We end up with the solution to a “wrong” problem described bythese inaccurate operators.We investigate this hypothesis simulating our test equation with a high parameter variation ona regular Cartesian mesh hosting 59,049 degrees of freedom. Four multigrid correction levels areemployed. Our experiment tracks both the residual development and the number of backgroundtasks pending in the ready queue. We reiterate that they are issued with low priority such that theincremental improvement of the assembly process does not delay the solver iterations.For smooth (cid:15) distributions, we have not been able to spot any deterioration of the residualevolution due to the delayed stencil integration (not shown). For rapidly changing (cid:15) , e.g. (cid:15) = 1 or (cid:15) = 10 − , the solver’s behaviour changes dramatically however (Fig. 9). Using a delayed assembly(stencil computation) deployed to background tasks, the solver iteration count required to reducethe normalised residual to a factor of 10 − doubles compared to a solve where all operators areaccurately computed prior to the solve kick-oﬀ. Initially, both methods show a similar rate ofconvergence, however the delayed solve soon enters a regime where its residual almost stagnatesaround 10 − . Throughout the initial residual decay, the number of pending background tasksreduces dramatically. While the residual stagnates, the number of background integration tasks20

10 20 30 40 50 60 70 80

Iteration | r ( n ) | h T o t a l un t e r m i n a t e dp r o c e ss e s background tasksprecomputedunterminated stencils Time [s] | r ( n ) | h T o t a l un t e r m i n a t e dp r o c e ss e s background tasksprecomputedunterminated stencils Figure 9: Convergence of delayed operator evaluation vs. precomputed stencils/operators periteration (left) and against real time (right).remains constant, however. Towards the end of the residual plateau, the number of backgroundtasks drops to zero and the solve recovers and exhibits multigrid performance again.Our solver spans one background assembly task per ﬁne grid cell initially and continues to workwith a geometric approximation to the local assembly matrix from thereon. Most of the assemblytasks are associated with cells covering smooth (cid:15) distributions. They thus discover that the assem-bly approximation is suﬃciently accurate almost immediately, i.e. after increasing n once. Theyterminate and do not reschedule any tasks for this particular cell. Only the few tasks associatedwith regions close to the signiﬁcant (cid:15) variations require repeated rescheduling while increasing n . Bythe time only these rescheduled tasks remain, the lack of accurate subcell material representationsfor some cells becomes detrimental to the rate of convergence. We reach a point where the currentsolution accuracy is balanced with the error of the stencils/assembly matrices that still have to beintegrated properly. Updates to the cell matrices hence “introduce” error—or rather expose errorsin the solution that the previously held stencil was unable to account for. Due to the elliptic natureof the operator these errors spread through the entire domain. The entire solver stalls. At the pointall the background integration tasks have converged, i.e. do not reschedule themselves anymore, weregain multigrid convergence as we ﬁnally solve the correct system that no longer changes.Dynamic termination criteria for the equation system solver have to be designed carefully withdelayed operator assembly, as the solver might converge or stagnate towards a wrong solution.While our convergence considerations seem to not favour the delayed, asynchronous assembly, mak-ing the comparison with regards to real time changes the picture (Fig. 9). An increased iterationcount in the solver is negated due to the headstart the delayed evaluation gives the solver. Thesetup also highlights that precomputing accurate stencils can take a greater amount of time than thesolve itself. Finally, we see that the time-to-solution of the delayed assembly is superior comparedto the explicit a priori assembly.As we kick oﬀ with low-accuracy operators, we eﬀectively merge the ﬁrst few multigrid cycleswith the actual assembly process: The point in time at which delayed evaluation has computed anaccurate solution representation is a similar point in time to that when the precomputed stencil hascomputed an accurate stencil; even though our precomputation routines employ a dynamic n choiceas well. Therefore the delayed method can be seen as a way of computing a reasonably accurateinitial guess, and the delayed assembly manages to maintain the lead from its headstart.A delayed operator integration pays oﬀ in time-to-solution for rough material parameters.21 .2 Rippling with dynamically adaptive meshes We continue with experiments where the grid is no longer ﬁxed. This adds an additional level ofcomplexity, as the used coarse grid operators change both due to the delayed integration plus dueto the information rippling. In a traditional AMR/multigrid setup, any change in the grid necessi-tates a change in all “coarser” equations. This introduces a recompute step per mesh reﬁnement.Our methodology hides the recomputation cost behind the solve. On the downside, informationpropagates at most one level per cycle up within the resolution hierarchy.It is not clear whether such a massive delay in the coarse grid assembly could lead into stabilityproblems or severe convergence penalties: The coarse grids are no longer acting upon the sameequation as the ﬁne grids all the time. While this is an eﬀect aﬀecting the previous experiments,too, dynamic adaptive mesh reﬁnement also makes the semantics of assembly matrices change:After each reﬁnement, former ﬁne grid discretisations suddenly become Ritz-Galerkin correctionoperators. With our tests, we investigate whether they still continue to push the solution in adirection that eﬀectively minimises the error. Our setup initially starts as a regular Cartesian meshhosting 68 degrees of freedom with a single multigrid correction level. This increases due to thereﬁnement to the order of 250,000 degrees of freedom and seven multigrid correction levels. Total processed DoFs | r ( n ) | h adAFAC-JacadAFAC-PI Total processed DoFs | r ( n ) | h adAFAC-JacadAFAC-PIadditiveMG Total processed DoFs | r ( n ) | h adAFAC-JacadAFAC-PIadditiveMG Figure 10: Residual plots for the jumping coeﬃcient problem and (cid:15) ∈ { − , } . All setups employdynamically adaptive mesh reﬁnement either reassembling all operators accurately (left) or usingdelayed operator assembly with geometric operators (middle) or algebraic operators (right).We use our second test setup with only one type of a discontinuous material jump over threeorders of magnitude. Initial results are given for (cid:15) ∈ { , − } (Fig. 10). Our dynamic adaptivitycriterion evaluates the solution’s gradient over Ω after each multigrid cycle and picks the degrees offreedom carrying the top 10% of the absolute gradient values. We reﬁne around these vertices andcontinue. Convergence requires a total number of updates in the order of 10 DoF updates. If asolve yields a residual that is 100 times bigger than the initial residual, we terminate the solver—even though the well-deﬁned ellipticity implies that the solver eventually will “converge back”. Ourbenchmark compares three diﬀerent solver ﬂavours with each other. In the ﬁrst variant, we imme-diately determine an accurate ﬁne grid operator whenever we reﬁne the grid. Furthermore, we stop22fter each reﬁnement and reassemble all coarse operators accurately before we continue. AlgebraicBoxMG operators and Ritz-Galerkin are used. In the second variant, we continue immediatelywhenever we reﬁne, but we use geometric inter-grid transfer operators. The ﬁne grid operatorsare improved through delayed integration and eventually yield improved Ritz-Galerkin coarse gridoperators. In the third variant, we ﬁnally let the inter-grid transfer operators ripple, too.The code with a complete re-assembly after each reﬁnement step converges with a rather shallowgradient ﬁrst. Throughout this phase, the grid is reﬁned on alternate cycles. Once the grid becomesstationary, the solver exhibits a linear residual descent with a steeper gradient. If we combine geo-metric inter-grid transfer operators with dynamic reﬁnement, Ritz-Galerkin operator computationsand delayed integration, the adAFAC-PI solver variant suﬀers from a massive residual increase,while the adaFAC-Jac variant outperforms our reassembly-based solvers by more than a factor ofﬁve. If all operators are algebraic, this adAFAC-Jac suﬀers from signiﬁcant, temporary residualexplosions which eventually are recovered. adAFAC-Jac is more robust yet still not as fast as itscousin with geometric inter-grid transfer operators.For both solver variants, our adaptive mesh reﬁnement reduces the approximation accuracytemporarily, as it replaces mesh cells likely fed by high accuracy stencils with ﬁner mesh cells withonly one integration point. This induces oscillations manifesting in temporary residual spikes. Asthe ﬁne grid cells start to improve their integration accuracy iteratively, the overall system accuracyrecovers. Until this is complete, the residual can continue to increase by many orders of magnitude,as the coarse grids solve an equation that is no longer a valid correction and hence push the solutioninto the wrong direction. With algebraic inter-grid transfer operators, this eﬀect is more distinctthan with geometric operators: We know that geometric operators spanning big discontinuitiesinduce oscillations on the ﬁne grid. In the present case, we run into situations where algebraicinter-grid transfer operators yield ﬁne grid corrections anticipating the real material parameterbehaviour, but the new ﬁne grid discretisation is not yet ready to mirror them.Rippling can cause dynamic mesh reﬁnement to introduce massive residual deterioration.Rippling yields temporarily incompatible equation system conﬁgurations. A straightforward ﬁx tothis behaviour would be an inheritance mechanisms for the number of cell integration points n : Ifa cell results from a coarse grid cell with n integration points, it could immediately start a ﬁrstﬁne grid integration with nk d approximation points. However, such an approach would become acaricature of the delayed approach as we would induce expensive assembly phases spread all overthe solve.We continue with an even harder setup choosing (cid:15) ∈ { , − } (Fig. 11). However, we usethe insight about the maximum rippling speed as follows: We negate correction steps on a coarselevel after each reﬁnement until we observe that the next ﬁner level operator we depend on hasbeen updated at least once—either due to the Ritz-Galerkin recomputation or due to the delayedintegration. This idea recursively applies to the next ﬁner level if it is reﬁned further.We recover the stability of multigrid with an explicit reassembly though we need around twiceas many DoF updates compared to a classical version. The improved stability is not dissimilarto classic multigrid theory where the F-cycle requires a higher order interpolation. We use classic d -linear interpolation here whenever we introduce new vertices. As we switch oﬀ the coarse gridcorrections, we eﬀectively smooth out this interpolation with a Jacobi step before we continue withmultigrid. The multigrid in turn is not switched on immediately but we eﬀectively work our waythrough a two-grid code, three-grid code, and so forth. In the ﬁrst solver phase where we add newgrid elements frequently, we only run series of ﬁne grid smoothing steps for the majority of the cycles.The residual decays nevertheless, as most errors that can be resolved by newly introduced vertices23 Total processed DoFs | r ( n ) | h adAFAC-Jac boxMG precomputedadAFAC-Jac geometric R/P asynchronous fineadAFAC-Jac boxMG delayed coarse grid contribution Figure 11: adAFAC-Jac with (cid:15) ∈ { , − } . We compare exact assembly to a Ritz-Galerkin coarsegrid computation with geometric inter-grid transfer operators and a full algebraic formulation usingBoxMG. The latter two options temporarily switch oﬀ coarse grids until multigrid information hasrippled through.here are high frequency errors which are damped out eﬃciently. At the same time, switching oﬀcoarse grid corrections tends to free compute resources which can be used to handle further stencilintegrations.It is reasonable to pair up delayed stencil integration with a careful choice of which coarse gridoperators are ready to be used in a multigrid cycle. For our memory footprint studies, we return to the ﬁrst test setup and study both a grid with h ≈ .

004 and one with h ≈ . ω , we track the average number ofbytes per element matrix entry, the maximum number of integration points per cell that we need,and the amount of memory saved due to delayed assembly in combination with our hierarchicalstorage and data compression. Our data focuses on the ﬁne grid only, i.e. we neglect coarse grideﬀects. They would blur the message and contribute to the memory footprint only marginally. Thesetup is conﬁgured such that the absolute error that we introduce by storing a truncated version ofthe hierarchical surplus is at most 10 − . This is close to machine precision. As we work with thehierarchical surplus, it is reasonable to use an absolute value rather than a relative value. Finally,we use delayed operator approximation and stop the iterative computation as soon as the relativediﬀerence between two subsequent evaluations with n and n + 1 do not diﬀer by more than onepercent anymore.For a rather smooth parameter choice, we see that the delayed operator integration stops afterit has tested the local assembly matrices for n = 2 against n = 1 variants. It cannot identify adiﬀerence exceeding one percent (Table 1). The lowest accuracy approximation is consequently usedall the way through. We store the used assembly matrix relative to the n = 1 approximation. Asthey are the same here, we do not actually have to store the matrix, but it is suﬃcient to bookmark24able 1: We track the maximum number of integration points per axis n , the average number ofpoints used over all cells, and the compression factor on a ﬁne grid with h ≈ .

038 in (7). θ = 1 θ = 16 θ = 64Cycle max { n } average n compression max { n } average n compression max { n } average n compression1 1 1.00 128.00 1 1.00 128.00 1 1.00 128.002 1 1.00 128.00 2 1.00 118.00 2 1.07 22.643 1 1.00 128.00 3 1.00 118.00 3 1.15 22.644 1 1.00 128.00 4 1.00 118.00 4 1.22 22.645 1 1.00 128.00 5 1.01 118.00 5 1.29 22.646 1 1.00 128.00 6 1.01 118.00 6 1.36 22.647 1 1.00 128.00 7 1.01 118.00 7 1.44 22.648 1 1.00 128.00 8 1.01 118.00 8 1.51 22.649 1 1.00 128.00 9 1.01 118.00 9 1.58 22.6410 1 1.00 128.00 10 1.01 118.00 10 1.65 22.64 a one-byte marker that ﬂags that there is no diﬀerence. A full element stiﬀness matrix requires2 d · d · (cid:15) distribution in (7), the delayed stencil integration increases the approxi-mation accuracy n of at least one cell per iteration. As this monotonously grows while the averageof the integration point choices remains close to 1 suggest that this is an extremely localised eﬀect.Most of the cells are suﬃciently accurate with n = 1, though the diﬀerence between the n = 1rediscretisation and the actual element matrices is not negligible anymore. We have to store thehierarchical diﬀerence and thus reduce the overall compression factor.For even higher θ choices, this reduction in compression eﬃciently becomes more dominant.The average n also starts to grow, i.e. more cells require a local assembly matrix which diﬀers froma simple n = 1 rediscretisation. We still reduce the memory footprint by more than one order ofmagnitude however.Our delayed integration in combination with compressed hierarchical storage makes the memoryfootprint of the solver increase over time.We “overcompress” all operators initially and gradually approach the most aggressive compressionwe can use without a loss of signiﬁcant bits. This is an advantageous property for algorithms withdynamic adaptivity which build up the mesh iteratively. As they start from small meshes, there isa low workload for the ﬁrst few iterations. The required memory, i.e. data amount to be transferredforth and back between cores and main memory, increases as the solver continues and becomesmore expensive. Conversely, a ﬁner mesh width eradicates the n -distribution observations, i.e. thesimulation can use n = 1 all over the domain. For h ≈ . n for the three setups from above. We return to a compression ratio of 128 and anaverage n close to 1.If the total memory footprint increases due to dynamic AMR, the delayed integration andcompression in return reduce the average memory footprint per mesh cell.25 erial 2 4 8 16 32Cores2 t i m e / c e ll [ t ] = s n=1, i.e. no (delayed) tasksn=8, f=0.1, delayedn=32, f=0.1, delayedn=8, f=0.1, delayed+asynchronousn=32, f=0.1, delayed+asynchronousn=8, f=0.01, delayedn=32, f=0.01, delayedn=8, f=0.01, delayed+asynchronousn=32, f=0.01, delayed+asynchronous serial 2 4 8 16 32Cores2 t i m e / c e ll [ t ] = s n=1, i.e. no (delayed) tasksn=8, f=0.1, delayedn=32, f=0.1, delayedn=8, f=0.1, delayed+asynchronousn=32, f=0.1, delayed+asynchronousn=8, f=0.01, delayedn=32, f=0.01, delayedn=8, f=0.01, delayed+asynchronousn=32, f=0.01, delayed+asynchronous Figure 12: Runtime per grid sweep for one discretisation with various integration/tasking conﬁg-urations. Small grid with h ≤ . h ≤ .

005 (right).Figure 13: Task distribution/placement for one setup with four cores. Top: No delayed tasking isused but each cell immediately determines an improved operator before it continues. Bottom: Weuse an anarchic, i.e. an asynchronous delayed operator integration. Brown labels denote computework, red is spinning (active waits), green denotes idling. The graph is a zoomed-in snapshotextracted from the total execution which spans 709.5s (a priori integration) vs. 428.3s (asynchronous,delayed integration).

We wrap up our experiments with simple single node studies—the compression and tasking paradigmhas sole single node eﬀect. The experiments run through a series of setups per tested grid. First,we assess the pure scalability of the code without any delayed integration and furthermore ﬁx n = 1.Next, we prescribe n > f ∈ { . , . } integration tasks per cell, i.e. be-tween one and ten percent of the cells spawn tasks. As pointed out before, this fraction in realapplications is not ﬁxed. We ﬁx it manually here to assess the impact on scalability of our idea.Finally, we run each experiment with a delayed integration twice: In the baseline, the synchro-nisation is a preamble to the cell evaluation. In the alternative test, there is no synchronisation,i.e. we spawn the integration and do not wait for the result actively at any point. We work totallyasynchronously. The setup is chosen such that the task spawn pattern mitigates the situation forhigh θ values, i.e. all spawned tasks correspond to cells close to the coordinate system axes.The partitioning with n = 1 yields reasonable performance (Fig. 12). This obviously is a “ﬂawed”setup from a mathematics point of view yet assesses that the underlying solver in principle doesscale. As the workload is deterministic—it is hard-coded and does not use any additional tasking—26he setup also clariﬁes that any tasking with n > f of the cells, we indeed observe a deteriorated scalability. This isdue to the fact that the high workload cells cluster along strong θ variations. We use a geometricdecomposition of the domain before we deploy the grid to the cores, and this decomposition triesto avoid disconnected partitions. As a consequence, one or few cores only are responsible for allthe high-workload cells (Fig. 13). With the anarchic tasking, we see that the scalability curveﬂattens out again and that we gain performance. This diﬀerence is greater with higher workloadper integration and with higher core count.The asynchronous, delayed element integration helps to regain some scalability for unbalancedsetups.We observe that the cores that run out of work towards the end of their mesh traversal pick up someof the pending integration tasks spawned by overbooked colleague threads. Heavy integration tasksautomatically slot into “idle” time of the baseline solver. The delayed, asynchronous integrationyields a solver with a performance and scalability proﬁle that is comparable to purely geometricmultigrid where all operators are computed geometrically with n = 1 sampling points per cell.Our scalability tests ﬁx the fraction f of cells that require an improved integration as well as n .They thus study only the scalability behaviour of one particular multigrid cycle. If we study thewhole time-to-solution of a solver, we ﬁnd that this behaviour typically translates into a walltimeof around 2/3 of the baseline. Baseline here is an implementation that uses the exactly same codebase yet realises the lazy evaluation pattern, i.e. computes all operators prior to the ﬁrst usageaccurately. Walltime always comprises both assembly phase and solve phase. Matrix assembly becomes a non-negligible part of the overall cost of a multigrid solve within manyapplication landscapes. It is thus important to optimise this step, too. Our proposed strategy toachieve this is three-fold: First, we abandon the idea to make the assembly fast. We instead make itmore expensive, as we switch from an a priori integration of the underlying weak form to an iterativeapproach. In our naive implementation without any hierarchical numerical integration, this leadsto redundant, repeated evaluations of sampling points. Nevertheless, we obtain an assembly thatyields rough approximations quickly. It reduces algorithmic latency. The actual accurate integrationis then delivered in the background of the actual computation. We hide this computational cost.Second, we introduce an anarchic variant of this delayed integration. We thus ignore previouslyexisting synchronisation points and obtain very high scalability. Finally, we propose a compressedaccuracy storage format where the data footprint evolution follows the integration accuracy used.In particular around the start time, we operate with low memory footprints and, hence, low memorybandwidth demands.Our assembly ideas face two extreme cases: Setups where expensive, algebraic operator inte-gration is not required—well-shaped domains with constant (cid:15) for example give us setups wherepure geometric multigrid succeeds—or setups for which an accurate operator computation is essen-tial, as material parameters jump dramatically. For the former case, delayed, asynchronous stencilintegration might introduce too much overhead, as it has to integrate each stencil at least twiceto come to the conclusion that a more accurate integration is not required. In the latter case, itmight yield a non-robust implementation. Even though we use inaccurate numerical approxi-mations initially, we obtain correct solutions with a reduced time-to-solution despite the increasedcomputational workload for our tests. Stabililty is to be expected given that we focus on elliptic,27inear problems. If properly implemented, these equations always yield the right solution agnosticof the initial guesses provided to them. Even if our anarchic, delayed approach introduces slightlyincorrect initial iterates, we remain stable. Our data however suggest that we have to be verycareful with dynamic termination criteria, and that it is very reasonable to anticipate if operatorparts are completely oﬀ. Trying to correct iterates stemming from inaccurate discretisations overlyaggressively can cause instabilities in solution behaviour—though only temporarily. We hencepropose to either fall back to geometric features within the mesh—this stabilises the convergencebehaviour—or we skip multigrid updates rather than to apply updates that introduce error.Through the lens of additive multigrid, we have looked at worst case scenarios. Inaccurateoperators here should, in the theory, propagate immediately through the whole multiscale system,and there are no inherent low-concurrency phases where tasks naturally can slot in. Even a domi-nance of cells that do not require iterative, accurate integration has not penalised our runtime andfootprint signiﬁcantly. The latter results from our low precision storage. For the runtime, a single-accuracy vs. two-sample-point-accuracy integration does not make a massive runtime diﬀerence ifthey are all per-cell, blocked (likely cache-local) operations, and we can hence hide any computeoverhead. The fact that our ideas have shown promise suggests that they will also be successful forreal-world challenges and other solver types. It is a natural next step to investigate into such morecomplex setups—time stepping codes where multigrid is only a building block, non-linear PDEs, orconvection-dominated systems, for example—and study the impact of our proposed techniques inmore detail for multiplicative solvers and more eﬀective smoothers which might be more sensitiveto inaccurate stencils. One of the most appealing strategies in an era of reducing or stagnatingmemory per core is our idea to store operator entries with truncated precision. So far, we use thisreduced precision only to store data. On the long term, it is natural to exploit this also for mixedor reduced precision computing. This is timely as we are currently witnessing the introduction ofreduced-precision compute formats due to the success of machine learning applications.

References [1]

PETSc manual pages . . Accessed: 2020-02-07.[2] P. Bastian, G. Wittum, and W. Hackbusch , Additive and multiplicative multi-grid-acomparison , Computing, 60 (1998), pp. 345–364.[3]

J. H. Bramble, J. E. Pasciak, and J. Xu , Parallel multilevel preconditioners , Mathematicsof Computation, 55 (1990), pp. 1–22.[4]

A. Brandt , Multi-level adaptive techniques (mlat) for singular-perturbation problems , Numer-ical Analysis of Singular Perturbation Problems, (1979).[5] ,

Guide to multigrid development , in Multigrid methods, Springer, 1982, pp. 220–312.[6]

A. Brandt, J. Brannick, K. Kahl, and I. Livshits , Bootstrap amg , SIAM Journal onScientiﬁc Computing, 33 (2011), pp. 612–632.[7]

M. Brezina, A. J. Cleary, R. D. Falgout, E. Henson, J. E. Jones, T. A. Manteuf-fel, S. F. McCormick, and J. W. Ruge , Algebraic multigrid based on element interpolation(amge) , SIAM Journal on Scientiﬁc Computing, 22 (2001), pp. 1570–1592.288]

M. Brezina, R. Falgout, S. MacLachlan, T. Manteuffel, S. McCormick, andJ. Ruge , Adaptive smoothed aggregation ( α sa) multigrid , SIAM review, 47 (2005), pp. 317–346.[9] , Adaptive algebraic multigrid , SIAM Journal on Scientiﬁc Computing, 27 (2006),pp. 1261–1286.[10]

P. M. De Zeeuw , Matrix-dependent prolongations and restrictions in a blackbox multigridsolver , Journal of computational and applied mathematics, 33 (1990), pp. 1–27.[11]

R. S. Dembo, S. C. Eisenstat, and T. Steihaug , Inexact newton methods , SIAM Journalon Numerical analysis, 19 (1982), pp. 400–408.[12]

J. E. Dendy , Black box multigrid , Journal of Computational Physics, 48 (1982), pp. 366–386.[13]

J. E. Dendy and J. D. Moulton , Black box multigrid with coarsening by a factor of three ,Numerical Linear Algebra with Applications, 17 (2010), pp. 577–598.[14]

J. Dongarra, J. Hittinger, J. Bell, L. Chacon, R. Falgout, M. Heroux, P. Hov-land, E. Ng, C. Webster, and S. Wild , Applied Mathematics Research for ExascaleComputing . Online, DOE ASCR Exascale Mathematics Working Group, 2014.[15]

W. Eckhardt, R. Glas, D. Korzh, S. Wallner, and T. Weinzierl , On-the-ﬂy memorycompression for multibody algorithms , in Advances in Parallel Computing—Volume 27: ParallelComputing: On the Road to Exascale, ParCo Conferences, IOS Press, 2015, pp. 421–430.[16]

B. Gmeiner, H. K¨ostler, M. St¨urmer, and U. R¨ude , Parallel multigrid on hierarchicalhybrid grids: a performance study on current high performance computing clusters , Concur-rency and Computation: Practice and Experience, 26 (2014), pp. 217–240.[17]

M. Griebel , Zur L¨osung von Finite-Diﬀerenzen-und Finite-Element-Gleichungen mittels derHierarchischen-Transformations-Mehrgitter-Methode [On the solution of the ﬁnite-diﬀerenceand ﬁnite-element equations through the hierarchical-transformational-multigrid method] , Tech-nische Universit¨at M¨unchen. Institut f¨ur Informatik, 1990.[18]

W. Hackbusch , Iterative Solution of Large Sparse Systems of Equations , vol. 95 of AppliedMathematical Sciences, Springer, second edition ed., 2016.[19]

L. Hart and S. F. McCormick , Asynchronous multilevel adaptive methods for solving partialdiﬀerential equations on multiprocessors: Basic ideas , Parallel Computing, 12 (1989), pp. 131–144.[20]

D. E. Knuth , The genesis of attribute grammars , in WAGA: Proceedings of the internationalconference on Attribute grammars and their applications, P. Deransart and M. Jourdan, eds.,. Springer-Verlag, 1990, pp. 1–12.[21]

B. Lee, S. F. McCormick, B. Philipp, and D. J. Quinlan , Asynchronous fast adaptivecomposite-grid methods for elliptic problems: Theoretical foundations , SIAM Journal NumericalAnalysis, 42 (2004), pp. 130–152.[22]

C. Lu, X. Jiao, and N. M. Missirlis , A hybrid geometric + algebraic multigrid method withsemi-iterative smoothers , Numerical Linear Algebra with Applications, 21 (2014), pp. 221–238.2923]

J. M. Mart´ınez and L. Qi , Inexact newton methods for solving nonsmooth equations , Journalof Computational and Applied Mathematics, 60 (1995), pp. 127–145.[24]

S. F. McCormick , Multilevel Projection Methods for Partial Diﬀerential Equations , SIAM,1992.[25]

S. F. McCormick and D. J. Quinlan , Asynchronous multilevel adaptive methods for solvingpartial diﬀerential equations on multiprocessors: Performance results , Parallel Computing, 12(1989), pp. 145–156.[26]

S. F. McCormick and J. Thomas , The fast adaptive composite grid (FAC) method forelliptic equations , Mathematics of Computation, 46 (1986), pp. 439–456.[27]

C. D. Murray and T. Weinzierl , Dynamically adaptive fas for an additively damped afacvariant , arXiv preprint arXiv:1903.10367, (2019).[28]

B. Reps and T. Weinzierl , Complex additive geometric multilevel solvers for helmholtzequations on spacetrees , ACM Transactions on Mathematical Software (TOMS), 44 (2017),pp. 1–36.[29]

J. Rudi, A. C. I. Malossi, T. Isaac, G. Stadler, M. Gurnis, P. W. J. Staar, Y. In-eichen, C. Bekas, A. Curioni, and O. Ghattas , An extreme-scale implicit solver forcomplex pdes: Highly heterogeneous ﬂow in earth’s mantle , in Proceedings of the InternationalConference for High Performance Computing, Networking, Storage and Analysis, SC ’15, NewYork, NY, USA, 2015, SIGHPC, ACM, pp. 5:1–5:12.[30]

J. W. Ruge and K. St¨uben , Algebraic multigrid , in Multigrid methods, SIAM, 1987, pp. 73–130.[31]

R. S. Sampath, S. S. Adavani, H. Sundar, I. Lashuk, and G. Biros , Dendro: Parallelalgorithms for multigrid and amr methods on 2: 1 balanced octrees , in Proceedings of the 2008ACM/IEEE conference on Supercomputing, IEEE Press, 2008, p. 18.[32]

R. S. Sampath and G. Biros , A parallel geometric multigrid method for ﬁnite elements onoctree meshes , SIAM Journal on Scientiﬁc Computing, 32 (2010), pp. 1361–1392.[33]

B. Smith, P. Bjorstad, and W. Gropp , Domain Decomposition: Parallel Multilevel Meth-ods for Elliptic Partial Diﬀerential Equations , Cambridge University Press, 2004.[34]

R. Speck, D. Ruprecht, M. Emmett, M. Bolten, and R. Krause , A space-time parallelsolver for the three-dimensional heat equation , Parallel Computing: Accelerating Computa-tional Science and Engineering (CSE), 25 (2014), pp. 263–272.[35]

H. Sundar, G. Biros, C. Burstedde, J. Rudi, O. Ghattas, and G. Stadler , Parallelgeometric-algebraic multigrid on unstructured forests of octrees , in SC’12: Proceedings of theInternational Conference on High Performance Computing, Networking, Storage and Analysis,IEEE, 2012, pp. 1–11.[36]

C. W. Trottenberg, U.and Oosterlee and A. Schuller , Multigrid , Academic Press,2000. 3037]

M. Weinzierl and T. Weinzierl , Quasi-matrix-free hybrid multigrid on dynamically adap-tive cartesian grids , ACM Transactions on Mathematical Software (TOMS), 44 (2018), pp. 1–44.[38]

T. Weinzierl , The peano software-parallel, automaton-based, dynamically adaptive gridtraversals , ACM Transactions on Mathematical Software (TOMS), 45 (2019), pp. 1–41.[39]

T. Weinzierl and T. K¨oppl , A geometric space-time multigrid algorithm for the heat equa-tion , Numerical Mathematics: Theory, Methods and Applications, 5 (2012), pp. 110–130.[40]

T. Weinzierl and M. Mehl , Peano—a traversal and storage scheme for octree-like adaptivecartesian multiscale grids , SIAM Journal on Scientiﬁc Computing, 33 (2011), pp. 2732–2760.[41]