Scaling Structured Multigrid to 500K+ Cores through Coarse-Grid Redistribution
SSCALING STRUCTURED MULTIGRID TO 500K+ CORESTHROUGH COARSE-GRID REDISTRIBUTION ∗ ANDREW REISNER † , LUKE N. OLSON ∗ . , AND
J. DAVID MOULTON ‡ Abstract.
The efficient solution of sparse, linear systems resulting from the discretizationof partial differential equations is crucial to the performance of many physics-based simulations.The algorithmic optimality of multilevel approaches for common discretizations makes them a goodcandidate for an efficient parallel solver. Yet, modern architectures for high-performance computingsystems continue to challenge the parallel scalability of multilevel solvers. While algebraic multigridmethods are robust for solving a variety of problems, the increasing importance of data locality andcost of data movement in modern architectures motivates the need to carefully exploit structure inthe problem.Robust logically structured variational multigrid methods, such as Black Box Multigrid (BoxMG),maintain structure throughout the multigrid hierarchy. This avoids indirection and increased coarse-grid communication costs typical in parallel algebraic multigrid. Nevertheless, the parallel scalabilityof structured multigrid is challenged by coarse-grid problems where the overhead in communicationdominates computation. In this paper, an algorithm is introduced for redistributing coarse-grid prob-lems through incremental agglomeration. Guided by a predictive performance model, this algorithmprovides robust redistribution decisions for structured multilevel solvers.A two-dimensional diffusion problem is used to demonstrate the significant gain in performanceof this algorithm over the previous approach that used agglomeration to one processor. In addition,the parallel scalability of this approach is demonstrated on two large-scale computing systems, withsolves on up to 500K+ cores.
Key words. multigrid, structure, parallel, scalability, stencil
AMS subject classifications.
1. Introduction.
The efficient solution of large, sparse linear systems resultingfrom the discretization of elliptic partial differential equations (PDEs) is crucial tothe performance of many numerical simulations. Although there has been significantprogress in developing general Algebraic Multigrid (AMG) solvers [27], modern high-performance computing (HPC) architectures continue to pose significant challengesto parallel scalability and performance (e.g., [3, 13, 5]). These challenges includereducing data movement, increasing arithmetic intensity, and identifying opportuni-ties to improve resilience, and are more readily addressed in settings where problemstructure can be identified and exploited. For example, robust structured variationalmultigrid methods, such as Black Box Multigrid (BoxMG) [10, 11], take advantage ofdirect memory addressing and fixed stencil patterns throughout the multigrid hierar-chy to realize a 10 × speed-up over AMG for heterogeneous diffusion problems [20].In addition, the communication patterns in a parallel BoxMG solve are fixed and pre-dictable throughout the multigrid hierarchy. Here we explore using this informationto improve the parallel scalability and performance of elliptic solves for problems withstructure.The meshing strategy used in the discretization of a PDE has a significant im-pact on the amount of structure that can be exploited by the solver. For example,single-block locally structured grids can be body fitted to capture smooth non-planargeometries [9], and can use embedded boundary discretization techniques to add ad-ditional flexibility to the representation of object boundaries. Robust structured vari- ∗ LOS ALAMOS REPORT LA-UR-17-22886 † Department of Computer Science, University of Illinois at Urbana-Champaign. ‡ Applied Mathematics and Plasma Physics, Los Alamos National Laboratory.1 a r X i v : . [ c s . M S ] M a r A. REISNER, L. N. OLSON, AND J. D. MOULTON ational methods, such as BoxMG, are directly applicable to these cases. In contrast,fully unstructured grids can capture very complex geometries, including non-smoothfeatures over a range of scales, but demand general algebraic multilevel solvers, suchas AMG or smoothed aggregation AMG. The needs of many applications lie betweenthese extremes, and a variety of adaptive or multi-mesh strategies have been devel-oped to preserve structure and enable its use in the discretization and solver. Theseapproaches generally lead to specialized hybrid solvers, favoring structured techniquesat higher levels of refinement and algebraic techniques below a suitably chosen coarselevel [25]. While these hybrid solvers present a variety of challenges, a single-blocklogically structured solver is a critical component of their design and performance.A common approach to parallel multigrid solvers for PDEs is to partition thespatial domain across available processor cores. However, on coarser levels, the localproblem size decreases and the communication cost begins to impact parallel perfor-mance. A natural approach to alleviate this problem is to gather the coarsest problemto a single processor (or redundantly to all the processors), and to solve it with a serialmultigrid cycle. This approach was first motivated by a performance model of earlydistributed memory clusters [14] where core counts were quite small, and it was usedin the initial version of parallel BoxMG. Unfortunately, as the weak scaling study inFigure 1 shows, this approach scales linearly with the number of cores, and hence, isnot practical for modern systems. A modification of this approach that gathers thecoarse problem to a multi-core node [23], and then leverages OpenMP threading onthat node, improves the performance but does not address the scaling. This chal-lenge of reducing communication costs at coarser levels is even more acute in AMGmethods, and this led to exploration of agglomeration to redundant coarse problemsat higher levels [3], as well as redistribution of smaller portions of the coarse problemto enhance scalability [13], leading to approximately 2 × speedup in the solve phase.An alternative to this two-level gather-to-one approach is a more conservativeredistribution strategy that can be applied recursively as the problem is coarsened.In single-block structured solvers, the decision to redistribute data is driven by bal-ancing communication costs in relaxation with diminishing local work. This approachwas first considered in a structured setting [28], and used a heuristic to guide recur-sive application of nearest neighbor agglomeration. Later, in the Los Alamos AMG(LAMG) solver, a heuristic was developed to guide the reduction of the number ofactive cores at each level by a power of two [17]. In this paper, an optimized redistri-bution algorithm is proposed for robust structured multigrid methods that balancesthe computation and communication costs at each level. The structured setting en-ables the enumeration of possible coarse-grid configurations, and a performance modelis developed to support optimization of the coarsening path that is selected throughthese configurations. The utility of this approach is demonstrated through scalingstudies extending beyond 100K cores on two modern supercomputers.The remainder of this paper is organized as follows. Section 2 highlights relevantfeatures of robust variational multigrid methods, examines the domain decomposi-tion implementation in BoxMG, and proposes a redistribution algorithm that can beapplied recursively. The performance model for this parallel algorithm with redis-tribution is developed in Section 3, and the optimization algorithm is presented inSection 4. Scaling studies are presented in Section 5 demonstrating the efficacy of theproposed method, and Section 6 gives conclusions.
2. Robust Variational Multigrid on Structured Grids.
In the case of asymmetric, positive definite matrix problem, a Galerkin (variational) coarse-grid op-
CALING MULTIGRID THROUGH COARSE-GRID REDISTRIBUTION
32 128 512 2048 8192Cores10 − − T i m e ( s ) Rectangular local problemSquare local problem
Fig. 1 . Weak scaling on Blue Waters for a local problem size of × (rectangular) and × (square) using V(2,1)-cycles with a gather-to-one serial BoxMG coarse-grid solver. erator is effective in defining the coarse-level problems because it minimizes the errorin the range of interpolation [7]. This variational operator is formed as the triplematrix product of the restriction (transpose of interpolation), the fine-grid operator,and the interpolation, making it well suited for black box and algebraic multilevelsolver algorithms. In the case of structured grids, the complexity of the coarse-gridoperators is bounded, and the stencil pattern can be fixed a priori .However, in this approach, the interpolation must be sufficiently accurate to en-sure the variational coarse-grid operator satisfies the approximation property [27].For example, in two-dimensional diffusion problems with discontinuous coefficients,bilinear interpolation is not accurate across discontinuities because the gradient ofthe solution is not continuous. Thus, a key element in robust multigrid methods is operator-induced interpolation [8, 27], which uses the matrix problem to construct in-tergrid transfer operators. In the case of structured grid problems, operator-inducedinterpolation is naturally motivated by noting the normal component of the flux iscontinuous [10], and its impact on the properties of the variational coarse-grid oper-ator is understood through its connection to homogenization [22, 19]. This approachhas natural extensions to non-symmetric problems as well [11, 29].Robust methods also require careful consideration of the smoothing operator.Although, Gauss-Seidel is effective for many problems, anisotropy often demandsalternating line smoothing or plane smoothing [8, 27]. With coarse-grid operatorsand interpolation defined, a standard multigrid cycling is used, for example a V-cycleas in Algorithm 1.In this paper, interpolation and coarse-grid operators are constructed using theimplementation in BoxMG [1, 10, 22] with standard coarsening by a factor of 2,however the approach applies to any structured method. BoxMG is used becauseits operator-induced interpolation is designed to handle discontinuous coefficients,and additional heuristics ensure optimal performance for Dirichlet, Neumann and A. REISNER, L. N. OLSON, AND J. D. MOULTON
Algorithm 1:
Multilevel V-cycle
Input: x L − fine-grid initial guess b L − fine-grid right-hand side A , . . . , A L − P , . . . , P L − Output: x L − fine-grid iterative solution for l = L − , . . . , do relax ( A l , x l , b l , ν ) { relax ν times } r l = b l − A l x l { compute residual } b l − = P Tl r l { restrict residual } x = solve ( A , b ) { coarse-grid direct solve } for l = 1 , . . . , L − do x l = x l + P l x l − { interpolate and correct } relax ( A l , x l , b l , ν ) { relax ν times } Fig. 2 . Domain partition. Circles represent degrees of freedom, with denoting points onprocess k , representing halo points needed for communication from other processors, and pointson other processors. Robin boundary conditions on logically structured grids of any dimension (i.e., it isnot restricted to grids 2 k + 1). In addition, only the fine-grid problem is specified;coarse-grid operators are constructed through the variational (Galerkin) product. Thepackage is also released as open source in the Cedar Project [21]. The distributedmemory parallel implementation of BoxMG divides the problem domain among avail-able processors. The processors are arranged in a structured grid and points in the finegrid problem are divided among processors in each dimension — see Figure 2. Sincethe computation is structured, inter-process communication occurs through nearestneighbor halo updates with a halo width of 1. An important feature of BoxMG isbounded complexity in the coarse-grid operator. By exploiting the structure of theproblem, BoxMG produces coarse-grid problems with a fixed structure. This resultsin known communication patterns and guaranteed data locality.Examining the distributed memory parallel implementation of BoxMG in the
CALING MULTIGRID THROUGH COARSE-GRID REDISTRIBUTION n tasks, where n is less than the number of processors. The problem then becomes choosing a desirablevalue of n which is dependent on the problem distribution on the processor grid. Thisagglomeration can be applied recursively to gradually reduce the size of the processorgrid as the size of the global coarse-grid problem is reduced. To extend parallel coarsening in a structured setting, thealgorithm introduced in this paper aims to redistribute the coarse-grid problem toan incrementally smaller subset of processors. Parallel coarsening continues until aparameterized minimum local problem size is reached. At this point, a set of possibleredistributions is enumerated. These redistributions are evaluated based on a costderived through a performance model. An optimal redistribution sequence is thenselected. It is important to note that since a redistributed problem on a given level alsolimits parallel coarsening, the selection algorithm is designed to be applied recursivelyto obtain the highest efficiency possible for the multigrid cycle. Section 4.1 discussesthe redistribution of a grid of processors to coarser processor grids.To redistribute the coarse-grid problem on a smaller number of processors, theprocessor grid is agglomerated into processor blocks. This agglomeration is performedby dimension to maintain a distributed tensor-product grid structure. Each processorblock is then mapped to one task in the redistributed solver.To map the coarse tasks to processors, two approaches are considered. The firstapproach uses one processor from each processor block. This is visualized in Figure 3where the following steps are taken:1.
Agglomerate: processors grouped into blocks to define coarse tasks2.
Gather: processors within each block perform gather on coarse problem3.
Cycle: cycling continues with redistributed problem4.
Scatter: iterative solution scattered after redistributed cycle completesThe second approach employs redundancy by mapping each processor in the processorblock to the coarse task. This is visualized in Figure 4 with the following steps1.
Agglomerate: processors grouped into blocks to define coarse tasks2.
Allgather: processors within each block perform allgather on coarse problem3.
Cycle: cycling continues redundantly with redistributed problemWhile the second approach avoids an additional communication phase at the endof each cycle and adds an opportunity for resilience through redundant cycling, thefirst approach avoids the increased network usage involved in simultaneous coarsecycles. Algorithm 2 supplements Algorithm 1 with steps needed for redistribution.The algorithm is annotated with parallel communication required for each step.
A. REISNER, L. N. OLSON, AND J. D. MOULTON G a t h e r S c a tt e r Fig. 3 . A redistribution of a × processor grid to × processor blocks. The boxes representthe processing elements and the circles represent their respective coarse-grid local problems. A ll ga t h e r × Fig. 4 . A redundant redistribution of a × processor grid to × processor blocks. Theboxes represent the processing elements and the circles represent their respective coarse-grid localproblems.
3. Performance Model.
In this section, a performance model is introducedfor the BoxMG V-cycle (see Algorithm 2). The model helps identify the parallelperformance limitations of the V-cycle, particularly at coarse-levels in the multigridhierarchy, and also provides a cost metric that is used to guide the coarse-level redis-
CALING MULTIGRID THROUGH COARSE-GRID REDISTRIBUTION Algorithm 2:
Multilevel V-cycle with Redistribution
Input: x L − fine-grid initial guess b L − fine-grid right-hand side A , . . . , A L − P , . . . , P L − p , . . . , p L − number of processors on each level Output: x L − fine-grid iterative solution for l = L − , . . . , do relax ( A l , x l , b l , ν ) { halo exchange } r l = b l − A l x l { halo exchange } b l − = P Tl r l if p l > p l − gather rhs ( b l − , p l , p l − ) { local gather } x = solve ( A , b ) for l = 1 , . . . , L − do if p l > p l − scatter sol ( x l − , p l , p l − ) { local scatter } x l = x l + P l x l − { halo exchange } relax ( A l , x l , b l , ν ) { halo exchange } tribution algorithm introduced in Section 4.A key kernel in the V-cycle is that of matrix-vector multiplication. Since thestencil-based computations for this operation are relatively uniform, the cost of paral-lel communication in matrix-vector multiplication is accurately modeled with a postal model [4], leading to a total cost of(1) T = n f · γ (cid:124) (cid:123)(cid:122) (cid:125) computation + α + m · β (cid:124) (cid:123)(cid:122) (cid:125) communication , with n f denoting the number of floating point operations, γ a measure of the com-putation rate or inverse effective FLOP rate, α the interprocessor latency, 1 /β thenetwork bandwidth, and m the number of bytes in an MPI message. The value γ isdetermined by measuring the computation time of a local stencil-based matrix-vectorproduct, while α and β are determined through standard machine benchmarks suchas mpptest . As an example, the parameters derived for a 9-point 2D stencil on BlueWaters, a Cray XE6 machine at the National Center for Supercomputing Applica-tions , are listed in Table 1. A more accurate model for communication may be used,particularly for multiple communicating cores with large message sizes [15] or to ac-count for network contention [6], which may play a prominent role in communication.In an L -level multigrid V-cycle, Algorithm 1 is modeled through(2) T V-cycle = T cgsolve + L − (cid:88) l =1 T l smooth + T l residual + T l restrict + T l interp + T l agglomerate . https://bluewaters.ncsa.illinois.edu/blue-waters A. REISNER, L. N. OLSON, AND J. D. MOULTON α β γ . µ s 5 .
65 ns/B 0 .
44 ns/flop
Table 1
Model parameters on Blue Waters.
In the following expressions, each component of (2) represents the actual imple-mentation and does not necessarily form a minimum bound on each portion of thecomputation. In the model parameters below, D denotes the number of dimensionsin the problem, n ld the number of local grid points in dimension d on level l , and n s the number of points in the stencil. Grid quantities involved in communicationand computation are assumed to be 8 byte double-precision floating-point numbers.The dimension of the parallel decomposition is also assumed to match the dimensionof the problem. The communication required for a halo exchange of width 1 in D dimensions on level l is modeled as(3) T l exchange ( D ) = 2 · D · α + 2 · D − (cid:88) d =0 n ld · · β. Smoothing, using Gauss-Seidel with n c colors, results in(4) T l smooth = 2 · n s · D − (cid:89) d =0 n ld · ( ν + ν ) · γ + n c · ( ν + ν ) · T l exchange ( D ) . Likewise, the residual computation is(5) T l residual = 2 · n s · D − (cid:89) d =0 n ld · γ + T l exchange ( D ) . Since it is unnecessary to communicate halo regions in restriction, the computationis entirely local, leading to(6) T l restrict = 2 · n s · D − (cid:89) d =0 n ld · γ. Following [12], the interpolation and correction computes u l ← u l + I ll − u l − + r l /C, (7)where r l is the previously computed residual and C is the center, diagonal stencilcoefficient of the operator. Interpolation for edges (see Figure 5) yields u l ← u l + ω w u l − + ω e u l − + r l /C (cid:124) (cid:123)(cid:122) (cid:125) F LOP s (8)for the x -direction (and similar for the y -direction). Likewise, the interior stencil (seeFigure 5) yields u l ← u l + ω sw u l − + ω se u l − + ω nw u l − + ω ne u l − + r l /C (cid:124) (cid:123)(cid:122) (cid:125) F LOP s (9)
CALING MULTIGRID THROUGH COARSE-GRID REDISTRIBUTION ω e ω w ω n ω s ω n ω e ω w ω s ω se ω sw ω nw ω ne Fig. 5 . Interpolation computation in 2D. The top two grids show computation performed onthe first coarse line in each dimension. With the exception of the first, each coarse point performsinjection to the corresponding fine point. Interpolation to the preceding fine point embedded in thecoarse-line is then computed using the surrounding coarse points. The bottom three grids showcomputation performed for interior coarse points. For each coarse point, the corresponding finepoint is injected. The preceding coarse points in each dimension are then interpolated. Finally, thelogical cell center preceding the coarse point is interpolated using the surrounding coarse points.
Here interpolation is given as weights stored at the coarse points. For example, theinterpolation stencil stored at a given coarse point is ω nw ω n ω ne ω w ω e ω sw ω s ω se . In total (with injection), interpolation in 2D is modeled as(10) T l interp = (cid:32) (cid:89) d =0 n ld + 20 · (cid:89) d =0 n l − d + 6 · (cid:88) d =0 n l − d (cid:33) · γ + T l exchange (2) . Using a similar derivation, interpolation in 3D is T l interp = (cid:32) (cid:89) d =0 n ld + 60 · (cid:89) d =0 n l − d + 15 · n l − · n l − + 6 · n l − · n l − + n l − (cid:33) · γ + T l exchange (3) . (11)To agglomerate the coarse problem, the right hand side is gathered within processorblocks before the redistributed cycle begins and approximate solution scattered afterthe redistributed cycle completes. This time is then given by:(12) T l agglomerate = (cid:40) T l gather rhs + T l scatter sol if p l > p l − , where T l gather rhs and T l scatter sol represent the time to gather and scatter the right-hand side and solution, as follows. The number of processors within a processor block0 A. REISNER, L. N. OLSON, AND J. D. MOULTON and the local problem size for a processor in a processor block is given as p l block = D − (cid:89) d =0 (cid:38) p l +1 d p ld (cid:39) , n l block = D − (cid:89) d =0 (cid:24) N ld p ld (cid:25) . Then the solution an MPI allgather or gather/scatter depends on whether theredistribution is redundant. Following [26], the cost of these collective operations isgiven by(13) T l gather rhs = (cid:100) log ( p l block ) (cid:101) · α + n l block · p l block − p l block · β (14) T l scatter sol = (cid:40) T l gather rhs else . Finally, the time for a coarse solve is the time to agglomerate to one processorcombined with the cost of a Cholesky direct solve:(15) T cgsolve = T + (cid:32) D − (cid:89) d =0 N d (cid:33) · γ. This assumes the Cholesky factorization has been computed and stored in the setupphase.This performance model provides a basic predictive model to begin exploring theguided redistribution algorithm proposed in this paper.
4. Optimized Parallel Redistribution Algorithm.4.1. Coarse Processor Grid Enumeration.
To enumerate potential redis-tributions, the fine-grid tasks described by the processor grid are agglomerated intocoarser tasks called processor blocks. In agglomerating by dimension, the processorblocks form a coarser tensor product grid. The coarse processor grids are enumeratedby beginning with a 1-D grid and by refining greedily by dimension with respect tothe agglomerated local problem size. Using a refinement factor of 2, the number ofenumerated processor grids is bounded by (cid:100) log n p (cid:101) . Table 2 illustrates this enumer-Redistribution Agglomerated Local Problem1 × × × × × × × × × × × × × × Table 2
Example redistribution enumeration using a fine grid problem of × degrees of freedomwith a × processor grid. The global coarse-grid considered for agglomeration contains × degrees of freedom. CALING MULTIGRID THROUGH COARSE-GRID REDISTRIBUTION ×
568 degrees of freedom witha 16 × The recursive enumeration of coarse processorgrids generates a search space of possible redistributions. To find an optimal redistri-bution, a state in the search space is given by the size of the processor grid and thesize of the coarse-grid associated with a distributed solver. The search space can beviewed as a directed graph — an example is given in Figure 6. The initial state isgiven by the top-level distributed solver and the goal state is the state with a 1 × proc gridcoarse grid 16 x 81136 x 7116 x 2284 x 18 4 x 171 x 516 x 1142 x 9 16 x 4568 x 361 x 171 x 52 x 171 x 5 8 x 171 x 5 Fig. 6 . Example redistribution search space: The fine grid global problem is × degreesof freedom with a × processor grid, yielding a fine grid local problem of × . The optimalpath through this redistribution space is highlighted. To search the state space, a path cost function is defined as(16) f ( s ) = g ( s ) + h ( s )where s is a vertex or state in the graph in Figure 6, g is the cost to reach s fromthe initial state, and h is an estimate of the cost to reach the goal state from state s .That is, g ( s ) represents the time predicted by the performance model for a solve phaseexecuted down to coarse-grid state s . For the heuristic function h ( s ), a weightedcombination of the coarse-grid problem and the processor grid size is used to predictthe cost. If the performance model is an accurate predictive model, an inexpensivepath from the initial state to the goal state will identify a redistributed solver with anefficient run-time. With the path cost function defined, the space of redistributionsis searched for an optimal path. This is performed in the MG setup phase to dictate2 A. REISNER, L. N. OLSON, AND J. D. MOULTON agglomeration when a coarsening limit is reached. Using a brute force approach bysearching every path for the best redistribution strategy incurs an O ( n p ) cost, sincethe redistribution search space is constructed recursively. Letting h = log ( n p ), andusing the above coarse processor grid enumeration, results in(17) T (0) = 1 , T ( h ) = h − (cid:88) k =0 T ( k ) + 1 . Since 2 = T (0) = 1 and since T ( h + 1) = h (cid:88) k =0 T ( k ) + 1= T ( n ) + h − (cid:88) k =0 T ( k ) + 1= 2 h + 2 h = 2 h +1 , this leads to T ( h ) = 2 h = 2 log ( n p ) = n p .To address the O ( n p ) cost, the A* algorithm is used to determine the optimalpath. While the worst case complexity for A* is is O ( n p ) for this search, the costwith an optimal heuristic with evaluation cost O (1) is the length of the solution path.The length of this path in the redistribution search is O (log ( n p )). Figure 7 showsthe A* heuristic is effective in avoiding the brute force complexity, but does not reachthe optimal complexity. − − − − T i m e ( s ) Brute ForceA* O (log ( n p )) O ( n p ) Fig. 7 . Performance of redistribution search algorithms weak scaling with local problem size × .
5. Experimental results.
To explore the performance of the proposed redis-tribution algorithm a standard 5-point finite volume discretization of the diffusion
CALING MULTIGRID THROUGH COARSE-GRID REDISTRIBUTION anisotropy is defined in order to focus on the parallel scalability of the coarse-gridredistribution algorithm (rather than the well-established convergence rate of themultigrid algorithm itself). Specifically, the following diffusion problem is used inthese numerical experiments, −∇ · ( D ∇ u ) = f ( x, y ) in Ω = (0 , × (0 , , (18) u = 0 on ∂ Ω , (19)where the diffusion tensor is D = diag[ r , r ] and r ≈
16. This compensating anisotropyresults in optimal convergence of multigrid V-cycles with point-wise smoothing.
Both the weak and strong scalability of multigrid V-cycles that use the proposed redistribution algorithm are important for applicationson exascale systems. Here, the discretization of the diffusion problem given above isused for both weak and strong scaling studies and the algorithmic components of theBoxMG multigrid library (e.g., interpolation, restriction) are used in the redistributedmultigrid V-cycles.Since, the cost of coarsening (with redistribution) and the cost of the coarse-grid problem are dominated by parallel communication, network speed and machinetopology play an important role in timings. To explore this dependency, two differentpetascale systems are considered in the scaling tests:
Mira An IBM Blue Gene/Q system at Argonne National Laboratory. Mira uses anIBM network which comprises a 5D torus with 49,152 compute nodes usingPowerPC A2 processors.
Blue Waters A Cray XE system at the National Center for Supercomputing Ap-plications (NCSA) at the University of Illinois at Urbana-Champaign. BlueWaters employs a 3D torus using a Cray Gemini interconnect and has 22,640XE compute nodes each with two AMD Interlagos processors.In each case, the machine parameters used in the performance model of Section 3 aredetermined using the b eff benchmark [24]. Weak Scalability.
In this section, a weak scaling study is conducted to highlightthe scalability of the redistribution algorithm at large core counts. For the numericalexperiments below, 10 V(2,1)-cycles are executed using two different local problemssizes: 568 ×
71 = 40 ,
328 and 288 ×
36 = 10 , https://bluewaters.ncsa.illinois.edu/blue-waters A. REISNER, L. N. OLSON, AND J. D. MOULTON
32 128 512 2048 8192 32768 131072 524288Cores10 − − − − T i m e ( s ) interpolationresidualredistributionrestrictioncoarse-solvesolvesmoothing Fig. 8 . Weak scaling on Mira with local problem size: × .
32 128 512 2048 8192 32768 131072 524288Cores10 − − − − T i m e ( s ) interpolationresidualredistributionrestrictioncoarse-solvesolvesmoothing Fig. 9 . Weak scaling on Mira with local problem size: × . With different network capacities, redundant redistribution may not yield the low-est communication costs. Indeed, the results in Figure 10 for the Blue Waters systemshow that while redundant redistribution of the data at course levels is inexpensive,the network contention introduced by redundant cycling contributes to an increase incommunication at high core counts. This suggests that triggering redundancy on aper-level basis could lead to reduced costs. In contrast, non-redundant redistribution(in Figure 10) exhibits high scalability — thus, it is used for the following runs onBlue Waters.Figures 11 and 12 show run times on Blue Waters for the multigrid solve phase,highlighting that the proposed algorithm achieves good parallel scalability here aswell. However, comparing runs on the two machines, it is apparent that the weakscalability is superior on Mira. This is attributed to the network capabilities andscheduling differences of the two machines. Jobs on Mira receive a dedicated, full toruspartition of the machine, which often reduces contention resulting from neighboringjobs on the machine, since partitions receive a full torus network. Moreover, the lower
CALING MULTIGRID THROUGH COARSE-GRID REDISTRIBUTION
32 128 512 2048 8192 32768 131072Cores10 − − − − T i m e ( s ) redistributionredundant redistributionsolveredundant solve Fig. 10 . Weak scaling on Blue Waters with local problem size × shows the significant im-provement of non-redundant redistribution compared to redundant redistribution beyond 8192 cores.
32 128 512 2048 8192 32768 131072Cores10 − − − − T i m e ( s ) interpolationresidualredistributionrestrictioncoarse-solvesolvesmoothing Fig. 11 . Weak scaling on Blue Waters with local problem size: × . dimensional 3D torus on Blue Waters also contributes to an increase in contentionwithin the running job.Nevertheless, on Blue Waters communication cost for redistribution remains rel-atively low in comparison to smoothing, which remains the dominant kernel in theoverall solve. Figure 13 decomposes the cost of smoothing into communication andcomputation. This decomposition indicates that the communication cost of the haloexchange is the primary contributor to the reduced scalability.In contrast to the network advantages of Mira, lower floating point performanceis observed for the key computational kernels on Mira than on Blue Waters. Thisdifference, combined with an extra core dedicated to operating system functions, yieldsmore predictable performance and superior scaling behavior. This is also observed fora variety of applications [18]: loss in performance on the Cray XE6 in comparison toBG/Q systems as the core count increases. However, with the data locality and fixedcommunication patterns of the algorithm, the network performance on Blue Watersis good enough to let its superior floating performance yield the best time to solution,solving approximately 2.5 times faster at 131K cores.6 A. REISNER, L. N. OLSON, AND J. D. MOULTON
32 128 512 2048 8192 32768 131072Cores10 − − − − T i m e ( s ) interpolationresidualredistributionrestrictioncoarse-solvesolvesmoothing Fig. 12 . Weak scaling on Blue Waters with local problem size: × .
32 128 512 2048 8192 32768 131072Cores10 − − − − T i m e ( s ) communicationmodel communicationcomputationmodel computation Fig. 13 . Weak scaling of smoothing routine on Blue Waters with local problem size × . Strong Scalability.
Turning to strong scalability, an isotropic model diffusionproblem is setup on a 3200 × CALING MULTIGRID THROUGH COARSE-GRID REDISTRIBUTION
64 256 1024 4096 16384Cores10 − − − − T i m e ( s ) interpolationresidualredistributionrestrictioncoarse solvesolvesmoothing Fig. 14 . Strong scaling on Blue Waters with problem size: × .
64 256 1024 4096 16384Cores10 − − − − T i m e ( s ) communicationmodel communicationcomputationmodel computation Fig. 15 . Strong scaling of smoothing routine on Blue Waters with problem size: × . a single halo exchange. This leads to strong scaling curves for these operations thatare very similar to smoothing, but much faster in absolute terms.In contrast, the restriction has no communication, and would exhibit perfectstrong scaling in the absence of redistribution. With redistribution fewer cores areused at coarser levels, and strong scaling begins to degrade at 2.5K dof per core,although speedup is still realized even at 625 dof per core. This observation highlightsthe complexity of trade-offs in multilevel algorithms on modern systems, and theimportant role of performance models in design and runtime optimization. To understand the impactof the redistribution path on solve time, the 568 ×
71 weak scaling problem used inFigure 11 with 2048 processors is executed over a variety of redistribution choices.The selected redistribution paths are listed in Table 3. Path 1 indicates the optimalredistribution path used in Figure 11 and is selected by the algorithm from Section 4.In stark contrast to Path 1 is Path 0, which is the original all-to-one redistributionpath that is known to scale poorly. The remaining paths highlight the flexibility inthe agglomeration sequence.8
A. REISNER, L. N. OLSON, AND J. D. MOULTON
Path Processor Grid Redistribution0 64 × → ×
11 64 × → × → × → × → × → × → ×
12 64 × → × → × → ×
13 64 × → × → ×
14 64 × → × → × → ×
15 64 × → × → ×
16 64 × → × → × → ×
17 64 × → × → × → × → ×
18 64 × → × → × Table 3
Example redistribution paths using a × processor grid and × local problem size. The bar graph in Figure 16 compares the run times and predicted times of themultigrid solve phase for the various redistribution paths shown in Table 3. Thepredicted times are computed using the performance model from Section 3, and arein good agreement with the actual run times. Comparing the run times of Path 0and Path 1, this graph shows a 40 times speedup of the new redistribution strategyover the original all-to-one strategy. In addition, Path 1 has the lowest run time ofseveral plausible alternatives to the all-to-one strategy. Thus, Figure 16 demonstratesthe utility of using a global search guided by a performance model as a predictive toolfor choosing optimal redistribution paths. Moreover, using this strategy to select theredistribution path delivered effective weak parallel scaling out to 500K+ cores onMira, and out to 132K+ cores on Blue Waters, while the original code was essentiallyunusable beyond 4K cores. − − T i m e ( s ) modelrun Fig. 16 . Model and run times on Blue Waters of various redistribution paths using a × processor grid and × local problem size. CALING MULTIGRID THROUGH COARSE-GRID REDISTRIBUTION
6. Conclusions.
Emerging architectures place an increasing importance on datalocality and minimizing data movement. In these environments, structured approachesbenefit from predictable memory access patterns and avoiding indirect addressing.This motivates the development of methods that exploit local structure to avoid in-curring the performance consequences of full algebraic generality. A robust structuredsingle-block multigrid implementation that scales well on modern architectures is animportant step toward this goal.In this paper, a new optimized recursive agglomeration algorithm for redistribut-ing coarse-grid work in structured multilevel solvers is introduced. This algorithmcombines a predictive performance model with a structure exploiting recursive enu-meration of coarse processor grids to enable a global search for the optimal agglom-eration strategy. This approach significantly improves the weak parallel scalabilityof robust, structured multigrid solvers such as BoxMG. In this study using BoxMGoperators, this new algorithm delivers very good weak scaling up to 524 ,
288 coreson an IBM BG/Q (Mira), and reasonable weak scaling up to 131 ,
072 cores on CrayXE (Blue Waters). The speedup over the previous all-to-one agglomeration approachis significant even at modest core counts, 40 times speedup on just 2048 cores and144 times speedup on 8192 cores. At larger core counts, the all-to-one agglomerationapproach became infeasible as the increased memory requirements for the coarse-gridproblem exceeded available memory.In addition, the strong scaling of multigrid solves using this redistribution algo-rithm is demonstrated on Blue Waters. Overall, the scaling limit is observed to beapproximately 10K dofs per core, which is similar to results obtained in other stud-ies and is dominated by the communication cost of the smoother. Future work onadditive variants of multigrid and improvements to the performance model to moreaccurately capture deep memory hierarchies are needed to reduce this limit.To focus on coarse-grid redistribution, only pointwise Gauss-Seidel relaxation isconsidered in this paper. In the future, smoothers that may enhance performance,such as hybrid or polynomial smoothers, or enhance robustness, such as line and planesmoothers [2], may be considered and their impact on optimal coarse-grid redistribu-tion explored.
Acknowledgments.
This work was carried out under the auspices of the Na-tional Nuclear Security Administration of the U.S. Department of Energy, at the Uni-versity of Illinois at Urbana-Champaign under Award Number DE-NA0002374, andat Los Alamos National Laboratory under Contract Number DE-AC52-06NA25396,and was partially supported by the Advanced Simulation and Computing / AdvancedTechnology Development and Mitigation Program. This research is part of the BlueWaters sustained-petascale computing project, which is supported by the NationalScience Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois.Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign andits National Center for Supercomputing Applications. This research used resources ofthe Argonne Leadership Computing Facility, which is a DOE Office of Science UserFacility supported under Contract DE-AC05-00OR22725.
REFERENCES[1]
R. E. Alcouffe, A. Brandt, J. E. Dendy, and J. W. Painter , The multi-grid method forthe diffusion equation with strongly discontinuous coefficients , SIAM J. Sci. Stat. Comput.,2 (1981), pp. 430–454.[2]
T. M. Austin, M. Berndt, and J. D. Moulton , A memory efficient parallel tridiagonal A. REISNER, L. N. OLSON, AND J. D. MOULTON solver , Tech. Report LA-UR 03-4149, Mathematical Modeling and Analysis Group, LosAlamos National Laboratory, Los Alamos, NM, 2004.[3]
A. H. Baker, R. D. Falgout, H. Gahvari, T. Gamblin, W. Gropp, T. V. Kolev, K. E.Jordan, M. Schulz, and U. M. Yang , Preparing algebraic multigrid for exascale , Tech.Report LLNL-TR-533076, Lawrence Livermore National Laboratory, Lawrence Livermore,CA, 2012.[4]
A. Bar-Noy and S. Kipnis , Designing broadcasting algorithms in the postal model for message-passing systems , in Proceedings of the Fourth Annual ACM Symposium on Parallel Al-gorithms and Architectures, SPAA ’92, New York, NY, USA, 1992, ACM, pp. 13–22,https://doi.org/10.1145/140901.140903, http://doi.acm.org/10.1145/140901.140903.[5]
A. Bienz, R. D. Falgout, W. Gropp, L. N. Olson, and J. B. Schroder , Reducing par-allel communication in algebraic multigrid through sparsification , SIAM J. Sci. Comput.,38 (2016), pp. S332–S357, https://doi.org/10.1137/15M1026341, https://doi.org/10.1137/15M1026341.[6]
A. Bienz, W. D. Gropp, and L. Olson , Topology-aware performance modeling of parallelspmvs . http://meetings.siam.org/sess/dsp talk.cfm?p=75934.[7]
A. Brandt, S. F. McCormick, and J. W. Ruge , Algebraic multigrid (amg) for sparse matrixequations , Sparsity and Its Applications, (1984).[8]
W. L. Briggs, V. E. Henson, and S. F. McCormick , A Multigrid Tutorial: Second Edition ,Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2000.[9]
G. L. Delzanno, L. Chac´on, J. M. Finn, Y. Chung, and G. Lapenta , An optimal robustequidistribution method for two-dimensional grid adaptation based on mongekantorovichoptimization
J. E. Dendy , Black box multigrid , J. Comput. Phys., 48 (1982), pp. 366–386.[11]
J. E. Dendy , Black box multigrid for nonsymmetric problems , Appl. Math. Comput., 13 (1983),pp. 261–283.[12]
J. E. Dendy and J. D. Moulton , Black box multigrid with coarsening by a factor of three , J.Numer. Lin. Alg. App., 17 (2010), pp. 577–598, https://doi.org/10.1002/nla.705.[13]
H. Gahvari, W. Gropp, K. E. Jordan, M. Schulz, and U. M. Yang , Systematic reduction ofdata movement in algebraic multigrid solvers , in Proceedings of the 2013 IEEE 27th Inter-national Symposium on Parallel and Distributed Processing Workshops and PhD Forum,IPDPSW ’13, Washington, DC, USA, 2013, IEEE Computer Society, pp. 1675–1682.[14]
W. Gropp , Parallel computing and domain decomposition , in Domain Decomposition Methodsfor Partial Differential Equations, Norfolk, VA, USA, 1992, Society for Industrial & AppliedMathematics, pp. 349 – 361.[15]
W. Gropp, L. N. Olson, and P. Samfass , Modeling MPI communication performance onSMP nodes: Is it time to retire the ping pong test , in Proceedings of the 23rd EuropeanMPI Users’ Group Meeting, EuroMPI 2016, New York, NY, USA, 2016, ACM, pp. 41–50,https://doi.org/10.1145/2966884.2966919, http://doi.acm.org/10.1145/2966884.2966919.[16]
G. E. Hammond, P. C. Lichtner, C. Lu, and M. R.T. , Pflotran: Reactive flow and trans-port code for use on laptops to leadership-class supercomputers , in Groundwater ReactiveTransport Models, F. Zhang, G. Yeh, and J. C. Parker, eds., Bentham Science Publishers,Sharjah, UAE, 2012, pp. 141–159, https://doi.org/10.2174/97816080530631120101.[17]
W. Joubert and J. Cullum , Scalable algebraic multigrid on 3500 processors , Electron. T.Numer. Ana., 23 (2006), pp. 105–128.[18]
D. J. Kerbyson, K. J. Barker, A. Vishnu, and A. Hoisie , Comparing the performance ofBlue Gene/Q with leading Cray XE6 and InfiniBand systems , Proceedings of the Inter-national Conference on Parallel and Distributed Systems - ICPADS, (2012), pp. 556–563,https://doi.org/10.1109/ICPADS.2012.81.[19]
S. P. MacLachlan and J. D. Moulton , Multilevel upscaling through variational coarsening ,Water Resour. Res., 42 (2006), W02418, https://doi.org/10.1029/2005WR003940.[20]
S. P. MacLachlan, J. M. Tang, and C. Vuik , Fast and robust solvers for pressure-correctionin bubbly flow problems , J. Comput. Phys., 227 (2008), pp. 9742–9761.[21]
D. Moulton, L. N. Olson, and A. Reisner , Cedar framework , 2017, https://github.com/cedar-framework/cedar. Version 0.1, release paperwork in process at LANL.[22]
J. D. Moulton, J. E. Dendy, and J. M. Hyman , The black box multigrid numerical homog-enization algorithm , J. Comput. Phys., 141 (1998), pp. 1–29.[23]
K. Nakajima , Optimization of serial and parallel communications for parallel geometric multi-grid method , in 2014 20th IEEE International Conference on Parallel and Distributed Sys-tems (ICPADS), Dec 2014, pp. 25–32.CALING MULTIGRID THROUGH COARSE-GRID REDISTRIBUTION [24] R. Rabenseifner , Effective bandwidth (b eff ) benchmark H. Sundar, G. Biros, C. Burstedde, J. Rudi, O. Ghattas, and G. Stadler , Parallelgeometric-algebraic multigrid on unstructured forests of octrees , in Proceedings of the In-ternational Conference on High Performance Computing, Networking, Storage and Analy-sis, IEEE Computer Society Press, 2012, p. 43.[26]
R. Thakur, R. Rabenseifner, and W. Gropp , Optimization of collective communicationoperations in MPICH , International Journal of High Performance Computing Applications,19 (2005), pp. 49–66.[27]
U. Trottenberg and A. Schuller , Multigrid , Academic Press, Inc., Orlando, FL, USA, 2001.[28]
D. E. Womble and B. C. Young , A MODEL AND IMPLEMENTATION OF MULTIGRIDFOR MASSIVELY PARALLEL COMPUTERS , International Journal of High Speed Com-puting, 2 (1990), pp. 239–255.[29]