A highly scalable approach to solving linear systems using two-stage multisplitting
aa r X i v : . [ c s . M S ] S e p A highly scalable approach to solving linearsystems using two-stage multisplitting
Nick Brown ∗ , J. Mark Bull , and Iain Bethune Edinburgh Parallel Computing Centre, James Clerk MaxwellBuilding, Kings Buildings, Edinburgh
Abstract
Iterative methods for solving large sparse systems of linear equationsare widely used in many HPC applications. Extreme scaling of these meth-ods can be difficult, however, since global communication to form dot prod-ucts is typically required at every iteration.To try to overcome this limitation we propose a hybrid approach, wherethe matrix is partitioned into blocks. Within each block, we use a highlyoptimised (parallel) conventional solver, but we then couple the blocks to-gether using block Jacobi or some other multisplitting technique that canbe implemented in either a synchronous or an asynchronous fashion. Thisallows us to limit the block size to the point where the conventional iter-ative methods no longer scale, and to avoid global communication (andpossibly synchronisation) across all processes.Our block framework has been built to use PETSc, a popular scien-tific suite for solving sparse linear systems, as the synchronous intra-blocksolver, and we demonstrate results on up to 32768 cores of a Cray XE6system. At this scale, the conventional solvers are still more efficient,though trends suggest that the hybrid approach may be beneficial at highercore counts.
Keywords.
Asynchronous Jacobi, multisplitting, asynchronous Block Jacobi,linear solvers, PETSc, MPI
As systems approach Exaflop performance, the core counts which are typicallyused to solve problems will increase dramatically. To exploit Exascale resourceseffectively, existing algorithms and implementations must be re-examined tobreak the tight coupling that often exists between parallel processes. We con-centrate on algorithms for the solution of linear equations
Ax = b , where A is ∗ Corresponding author: +44 (0) 131 650 6420, [email protected] n × n matrix and both x and b are vectors. Solving such large,sparse linear systems is critical in many fiends of scientific computing and sosolution methods which can efficiently run at Exascale are crucial.In Bethune et al.[1] we have shown that for iterative solution methods, in-stead of relying on synchronous communication for exchanging updated vectorinformation one can gain an improvement in performance and resilience at largecore counts by using asynchronous communication where processes need notwait for their neighbours at each iteration and the calculation is therefore notglobally synchronised. However, our experiments only considered a point Jacobiiterative solution method which typically converges very slowly.In this paper we extend our work by applying Jacobi’s algorithm, and othermultisplitting techniques, at the block level using asynchronous communication.Within each block we use much more computationally efficient Krylov Subspace(KS) methods in an attempt to combine the advantages of scalability and re-silience from the asynchronous methods with the computational efficiency ofKS methods. We investigate the performance and scalability of these differentapproaches on up to 32768 cores on a Cray XE6, and consider the effects onperformance of tuning various parameters in the hybrid algorithm.The rest of the paper is organised as follows: Section 2 reviews the existingalgorithms, and in Section 3 we present our hybrid algorithm. Section 4 reportsexperimental results, and Section 5 draws some conclusions and considers futurework. Jacobi’s algorithm is the simplest iterative solution method. Whilst the con-vergence rate of this algorithm is inferior to other, more complex methods, thefact that the only global communication required is in computing the residualmakes it highly scalable. This aspect of the algorithm makes it of interest toExascale where problems must be decomposed over very many cores.For a linear system, Ax = b , one starts with a trial solution x and generatesnew solutions, iteratively, according to x ( k ) i = a ii ( b i − P i != j a ij x ( k − j ) where k is the iteration number. One stops iterating when the norm of the globalresidual is smaller than a specific tolerance, k Ax − b k < tol .If A represents a discretised local operator, resulting from the application ofa small stencil over a domain, then the parallel implementation is very simple.In this paper we solve the 3D Laplace equation, ▽ u = 0 using a seven pointstencil, and the Jacobi algorithm can be written in pseudo code as: for all grid pointsunew (i ,j ,k) = 1/6 * ( u(i +1 ,j ,k)+u(i -1 ,j ,k)+u(i ,j+1 , k)+u(i ,j -1 , k)+u(i ,j ,k +1) +u(i ,j ,k -1) ) Krylov subspace methods are more efficient algorithms for solving sparse linearsystems. A large number of iterative algorithms exist, based upon Krylov sub-spaces, with common ones being Conjugate Gradient (CG)[2] and GeneralizedMinimal RESidual method (GMRES)[3]. It is very common for scientific codesto use one of these solver kernels and some excellent toolkits exist to facilitatethis. One such library; the Portable, Extensible Toolkit for Scientific Computa-tion (PETSc)[9] is a suite of data structures and routines developed by ArgonneNational Laboratory for use in finding parallel solutions to scientific problemsmodeled by partial differential equations. Toolkits such as PETSc support avariety of Krylov subspace methods and allow for scientific programmers toutilise these more advanced, performant, iterative solvers. Additionally, lowlevel aspects of parallelism such as process synchronisation and communicationare abstracted away by the library.Figure 1 shows the relative residual against time when using GMRES fromPETSc for solving the 3D Laplace’s equation with Dirichlet boundary conditionson a Cray XE6 with different numbers of cores, using one MPI process per core.3 R e s i dua l Time 1024409616384
Figure 1: GMRES Relative residual against time for different core countsSince in each case the local domain is 50 × ×
50 points per core, this is a weakscaling test. It can be seen that as the core count increases the performancedrops. When run over 4096 cores it takes 2.88 times as long to solve the systemthan it did over 1024 cores (although the system is larger due to weak scaling.)This performance decrease gets worse, on 16384 cores the runtime is 6.34 timesthat of 1024 cores to find a solution. Some, but not all of the increase inexecution time is due to a rise in the number of iterations required. The averagetime per iteration over 1024 cores is 0.05 seconds, when run on 16384 cores thishas increased to 0.1 seconds. This suggests that existing, conventional, iterativesolution methods will not scale to the core counts that will be likely requiredat Exascale. We were unable to run this test on 32768 cores as the code failedwith an out of memory error.Cores Iterations to solution1024 7724096 140416384 1767Table 1: GMRES iterations to relative residual of 10 − Table 1 illustrates the number of iterations required to reach a relative resid-ual of 10 − for this problem when solved using GMRES. It can be seen that, asone increases the number of cores then the problem is becoming more difficultto solve as more iterations are needed. The reason for this is that, as we weakscale and the problem size becomes larger, the spectral radius of the matrix also4ncreases which requires more iterations to convergence. Often a matrix A has a natural splitting into A = M − N where linear systemsinvolving M can be solved easily. A splitting of A in this manner results ina linear system of the form M x k +1 = N x k + b which starts with an initialguess x and the system converges if the spectral radius of the iteration matrix( M − N ) is less than 1. A number of common splittings of the matrix A havebeen developed and many of these have been studied in [4]. If a matrix can besplit in multiple ways, i.e. A = M i − N i , for i = 1 , . . . , k . then the multisplittingof A is defined by B = P ki =1 D i M − i N i . This is an attractive approach becausethe work for each individual splitting can be assigned to distinct processes.Two stage algorithms further the matrices M i into two further componentsand perform a number of inner iterations on these. A variety of work hasbeen done investigating factors such as whether to fix the number of inneriterations, or to vary it as the number of outer iterations progresses [5]. Anotherchoice is that of communication and [6] considers making the the outer iterationsasynchronous, where processors are allowed to start the computation of the nextinner iteration without waiting for the completion of the same outer iterationon other processes. In this case, the previous iteration’s data is not guaranteedto be available to all processes: the latest received, but not necessarily most upto date, values will be used.Two stage multisplitting such as [7] concentrates on using a simple iterativemethod such as Jacobi to solve the inner stage. However the performance ofthese approaches is severely limited and so a two stage Krylov multisplittingalgorithm [8] has been developed. Note the much of the literature in this areais of a theoretical nature, typically focusing of proving convergence criteria, anddoes not often address how the splittings, communication modes, or the numberof inner iterations affect real world performance or scaling. In [1] we showed that as we increased the core count, using asynchronous com-munication in the point Jacobi algorithm becomes beneficial. However, pointJacobi is of little practical interest: the benefits gained from using asynchronouscommunication are far outweighed by the slow convergence of the algorithm.It is possible to write a linear system in block form, where each block X i is made up of a number of individual elements, x i and the matrix A is splitas shown in figure 2. The Jacobi iterative algorithm can then be rewritten interms of blocks as X ( k ) i = A − ii ( B i − P j != i A ij X ( k − j ).5igure 2: Linear system rewritten as blocksFor each block we now need to solve the linear system A ii X ( k ) i = ( B i − P j != i A ij X ( k − j ). One can see that if the rows of A are distributed acrossprocessors in the same way as x , then the only communication required toassemble the right hand side is collecting the necessary blocks X . Assuming do-main decomposition and local operators, this requires only halo swaps betweenneighbouring domains with the inner solve of the block systems being fully local.Our hybrid approach is a two stage multisplitting algorithm where the innersolve of each block can be done by a group of processes using some efficient algo-rithm such as CG or GMRES. The inter- and intra-block concerns are separatedand whilst the intra-block solve is done synchronously, the halo swap betweenblock groups after each block Jacobi iteration can be done asynchronously. List-ing 1 illustrates a pseudo code example of this block scheme. This inter-blockcommunication is implemented in our framework on a processor to processor ba-sis rather than funnelling all communications via a master, and uses the sametechniques as in [1]. If new data is available after an asynchronous halo swapthen this will be used in the next inner block solve, and if not, then the innersolve will continue using existing data from a previous iteration. The asyn-chronous halo swap maintains R buffers ( R =100 in the experiments for thispaper) which hold sent and received halo data for each neighbour along withthe MPI asynchronous communication requests. At all stages there are R out-standing receives registered for each neighbour and at each halo swap old sendsare cleaned up and their buffer space returned to the pool. If there is bufferspace available then an additional neighbour asynchronous halo send is issuedand a check is then made as to how many pending receives have completed sincethe last halo swap. If any have completed then data from the latest of thesereceives is copied into the actual solution halo and pending receives are thenreissued.The relative residual is obtained for each block and from this the globalrelative residual is the square root of the sum of the square of block local rel-ative residues. Communication of the block local residues can be done using atree-based asynchronous reduction, as described in [1], which means that pro-cesses have an estimate of the global residual, but it might be some number ofiterations old. It should be noted that the inbuilt MPI synchronous allreduceoutperforms our own asynchronous allreduce and this is because MPI collec-tive communications are heavily tuned and often optimised in hardware. The6ew MPI 3.0 standard contains non-blocking collective communications whichcould be used instead of our asynchronous reduction, and would likely be moreefficient l o o p w h i l e g l o b a l r e s i d u a l i s l a r g e r than t h r e s h o l dperform i n n e r KS b l o c k s o l v eh a l o swap at b l o c k l e v e l ( asy n ch ron ou s communication )recompute g l o b a l r e s i d u a l ( asy n ch ron ou s communication ) Listing 1: Pseudo code of the block schemeAs discussed in Section 2, point Jacobi scales well, but is slow to converge.More performant methods such as CG and GMRES as seen in Section 3, mightperform well on medium core counts but do not scale to the level required forExascale. Our proposed asynchronous two stage multisplitting algorithm aimsto achieve the best of both worlds; the scalability of Jacobi (at the outer stage)and the good convergence of KS methods (at the inner stage.)
We have built a framework, which uses PETSc for the inner solver and allowsus to control many parameters to investigate these two stage multisplittingalgorithms. Using PETSc we can easily select between different inner solverswhich scale reasonably well.Whilst the framework itself supports a variety of pluggable problem defini-tions, as discussed in this paper we concentrate on the 3D Laplace equation.A variety of parameters must be selected before the run: the number and di-mensions of groups, number of inner iterations and size of the block overlap.In our framework, communication involving overlapping solution values is com-bined with that of normal block halo swapping. Therefore, asynchronous haloswapping also results in asynchronous resolution of overlapping solution values.
By separating the concerns of the inter- and intra-block levels we can configureeach independently for optimised performance, scalability and resilience. Thefirst, and most obvious, choice is the size of the inner blocks and how theglobal domain is decomposed. Figure 3 illustrates the relative residual againsttime for different block configurations over 4096 cores with a problem size of50 × ×
50 per core running on a Cray XE6 system. In this figure threeconfigurations are plotted, 2 blocks in the x-dimension (gx2), 16 blocks in thex-dimension (gx16) and 16 blocks distributed more evenly between the x-, y-and z-dimensions (gx4gy2gz2.) It can be seen that the optimal number of blocksis two and distributing blocks over all three dimensions is preferable to all blocksin one single dimension. The fact that the optimal number of blocks is two withthis core count and problem size supports the intuition that when the limit ofconventional solvers are reached, the solution space can then be partitioned intoblocks for increased scalability and at this test configuration the scalability limit7f the inner solver has not yet been reached. Solving to a relative residual of10 − , gx=2 takes 1106 outer iterations, gx16 requires 2984 outer iterations andgx4gy2gz2 results in 2298 outer iterations. It can be seen that, as we increasethe number of groups then this increases the number of outer iterations requiredto solve the problem. This supports the intuition that, as one splits the matrixfor these groups the spectral radius of the iteration matrix increases, and hencemore iterations are required to reach convergence. It is an interesting resultthat distributing the blocks over the dimensions (gx4gy2gz2) rather than in onedimension (gx16) is preferable - both in terms of runtime and number of outeriterations. This is likely due to the fact that distribution in one dimension onlymeans that at each halo swap blocks receive data from most two neighboursand progress made in the right most block takes a number of iterations to reachthe left most block. Contrasted against distributing the blocks amongst alldimensions, halo swaps involve more neighbours and-so updates can percolatethroughout the system much faster. R e s i dua l Time gx2gx16gx4gy2gz2
Figure 3: Relative residual against time for different group configurations, inneriterations=10 and no overlapAnother important inter-block factor is deciding what the inner solve thresh-old should be, i.e. when the inner solve is terminated and a halo swap performedbefore the next block Jacobi iteration. This is a trade-off between the frequencyof halo swapping between blocks and the progress made at each inner solve.Figure 4 illustrates the relative residual against time for different numbers ofmaximum inner GMRES iterations for a problem size of 50 × ×
50 per corerunning on 4096 cores. It can be seen that the best choice depends heavilyon the progression of the solution with time. Both 5 and 50 maximum inner8terations perform favourably initially, but when the relative residual reachesa certain point then performance decreases considerably, and the choice of 15maximum inner iterations is better for this problem when solving to higher pre-cision. The optimal choice for the inner block solver convergence criterion isproblem dependent and currently we allow for the number of inner iterationsto be fixed with one specific value for the entire solve, and this value must betuned manually. The number of outer iterations is heavily linked to the numberof inner iterations; with a smaller number of inner iterations then the numberof outer iterations is large and a larger number of inner iterations results in asmaller number of outer iterations. For instance, solving to a relative residual of7 × − results in 3000 outer iterations when the number if inner iterations isset to 5, 1035 outer iterations with 15 inner iterations and 360 outer iterationswith 50 inner iterations. R e s i dua l Time 51550
Figure 4: Relative residual against time for numbers of max inner iterations,gx=2,gy=gz=1 and no overlapAt the inter-block level, the ability to overlap the solution space at the blocklevel can help accelerate block convergence. Multisplitting was first introducedin [4] where it was shown that overlapping blocks are still guaranteed to con-verge, and may do so faster. A scheme is needed for determining the overallvalue of an element by weighting contributions from multiple overlapping blocksand when considering a complex data decomposition, several groups can over-lap the same element. Figure 5 illustrates the relative residual against time fordifferent overlaps over 4096 cores. It can be seen that some overlapping of thesolution space does improve performance, although the optimal choice is difficultto determine a priori. The performance when overlapping the solution spaces9y ten elements, which initially is very good, starts to degrade as we go to moreaccurate solutions. Generally it has been found that overlapping by betweenone and five elements in each dimension is a reasonable choice. The amountof overlap has an impact upon the number of outer iterations required to solvethe problem. Solving to a relative residual of 2 × − , with no overlap we see2341 outer iterations, with an overlap of 1 element this decreases to 1771 outeriterations, an overlap of 5 elements to 1330 outer iterations and 10 elementsto 1920. This helps explain the results in Figure 5; it can be seen that blockconvergence accelerates both in terms of runtime and outer iterations as thewidth of the overlap is increased to a point, however after reaching this pointthen additional overlap starts to slow block convergence. This can be explainedby the fact that, regardless of the overlap, the solution size remains the same,so by increasing the amount of overlap increases the number of unknowns oneach process. The number of additional unknowns per process in 3D, assumingneighbours on all sides, is given by o(2xy+2yz+2zx), where o is the size of theoverlap, and x, y and z are the number of unknowns in each dimension. Inour experiments x,y,z = 50 (125000 unknowns per process) Therefore with anoverlap of 1, there are 15000 additional unknowns per process, an overlap of5 results in 75000 additional unknowns per process and overlap of 10 requires150000 additional unknowns. R e s i dua l Time 01510
Figure 5: Relative residual against time when overlapping solution space, gx=2,gy=gz=1 and inner iterations=10 10
Results
We evaluated the block Jacobi approach for the Laplace equation on a CrayXE6. For each core count we have evaluated a conventional GMRES solveagainst block Jacobi using synchronous and asynchronous halo swapping. Thereare two blocks in the x-dimension and the local problem size is 50 × ×
50 percore using weak scaling. All other parameters have been tuned to be the mostoptimal settings for this problem at that core count. R e s i dua l Time GMRES parallel solverSynchronous two stageAsynchronous two stage
Figure 6: Relative residual against time for 1024 coresFigure 6 illustrates the performance results over 1024 cores. It can be seenthat at this core count solving the entire system conventionally outperforms theblock Jacobi approach quite by some margin. This is no real surprise because,as seen in Figure 1, GMRES is still scaling well at this core count. It can alsobe seen that asynchronous block Jacobi communication is slightly slower thansynchronous communication with the synchronous version requiring 1395 outeriterations until convergence (10 − ) and the asynchronous version 1567 outeriterations. Due to the small number of cores, the cost of data being out ofdate in the asynchronous version and hence requiring more outer iterations tosolution outweighs the benefits gained by not having to wait for communicationsto complete in a block halo swap. 11 R e s i dua l Time GMRES parallel solverSynchronous two stageAsynchronous two stage
Figure 7: Relative residual against time for 16384 coresThe performance results for 16384 cores are illustrated in Figure 7. One cansee that the performance difference between solving the entire system conven-tionally using a single KS method and splitting it into blocks is smaller thanon 1024 cores. The synchronous block version requires 680 outer iterations tosolution and the asynchronous version 810 outer iterations to convergence. Itshould be noted that the actual number of iterations is smaller for the 16384run compared to the 1024 run - this is simply because we have tuned the runparameters to be most efficient at the specific scale and for this larger run we areusing 25 inner iterations compared to 10 which results in fewer outer iterations.Nevertheless, the results at this core count indicate that the cost of using outof date data and the increased number of outer iterations still outweighs thebenefits gained from asynchronous communications.Table 2 lists the ratio of execution time for each version compared to itsrun time on 1024 cores. It can be seen that the versions employing the blockJacobi approach scale better with the numbers of cores than the conventionalsolver version. Table 3 illustrates the iteration rate (number of inner iterationstimes number of outer iterations) for each version. It can be seen that not onlydoes the block Jacobi algorithm, regardless of communication method, sustain ahigher iteration rate but, comparing the iteration rate between 1024 and 16384cores we can see that when using only GMRES the rate it decreases by 47.5%,12hereas for Block Jacobi using synchronous communuication it only drops by25.6% and for asynchronous 26.2%.Cores Conventional Group Sync Group Async1024 1 1 14096 2.88 1.47 1.5716384 6.24 2.39 2.55Table 2: Ratio of execution time to 1024 cores for each versionCores Conventional Group Sync Group Async1024 19.20 26.11 26.214096 16.57 20.72 20.7816384 10.05 19.41 19.34Table 3: Iteration rate (iterations per second)
It has been mentioned that a conventional solve using GMRES would not runover 32768 cores with a local problem size of 50 × ×
50 elements due to mem-ory constraints, so we have run the same problem with a local problem size of20 × ×
20 using 32768 cores. The synchronous version required 350 outeriterations and the asynchronous version 392. It can be seen from Figure 8 thatwhilst over this much smaller problem size the version using only GMRES per-forms much better, it is interesting to see that the performance of synchronousand asynchronous block Jacobi is much more comparable. This result is not sur-prising as it is often seen with large numbers of cores asynchronous algorithmsscale better then their synchronous counterparts. With greater levels of paral-lelism, more cores are waiting for each other in the synchronous halo swap andfactors such as network latency can become an issue. As explained in Section 3,each block level halo swap involves each process on a face communicating withits neighbouring process on the face of the corresponding block, rather thanfunneling all data through a master process for each block which would not bescalable. This means that synchronous block level halo swaps may only proceedwhen all of these individual processes have completed communications. Instead,when each process on the block face communicates in an asynchronous fashion,the halo data used is the latest to be received but potentially not the most upto date in the system. Nevertheless, even though the data might not be entirelyup to date, this becomes more favourable compared to synchronously waitingfor all block level halo communications to complete. It is our expectation thatexperiments carried on larger numbers of cores will start to see the asynchronousblock level communication out perform the synchronous communications.13 R e s i dua l Time GMRES parallel solverSynchronous two stageAsynchronous two stage
Figure 8: Relative residual against time for 32768 cores
We have implemented a framework for solving large, sparse linear systems via ablock Jacobi method with asynchronous or synchronous communication. Fromour investigations the most important tunable parameter has been the methodof communication and we have seen that this has an impact upon the overallperformance. For smaller core counts using GMRES without block Jacobi out-performed our approach quite considerably, however as the number of cores hasbeen increased this performance gap has closed because the block Jacobi ap-proach scales better. With core counts up to and including 16384, synchronousblock Jacobi communication outperforms asynchronous communications. How-ever, as we move to 32768 cores, albeit with a smaller local problem size, bothperform equally well.Additional advantages of asynchronous block Jacobi at Exascale include re-silience to performance faults such as slow cores, slow nodes or slow networklinks. Traditional synchronous solution methods are limited to the iterationrate of the slowest process but those based on asynchronous communication cancontinue to progress even if a group of processors are not contributing for someperiod of time.From our results we have seen that as we increase the core count both theblock Jacobi method and asynchronous communication are starting to become14ompetitive. It is our intent to further investigate with much larger core countsand, as traditional solution methods reach their limits, partition the solutionspace into larger numbers of blocks to enable further scaling. We will investigatethe performance for more realistic problems, and try to reduce the number andsensitivity of the tunable parameters. Another possibility is to investigate auto-tuning of these parameters at runtime. Not only will this avoid the difficultyof estimating these values a priori, it will also allow for the chosen parametersto be modified during the solve: as we saw in Figure 4 certain choices performwell initially but then worsen as the solution progresses. Selecting appropriatevalues at different stages of the calculation should result in an improvement inoverall performance.
References [1] I. Bethune, M. Bull, N. Dingle, N. Higham. “Performance analysis of asyn-chronous Jacobis method implemented in MPI, SHMEM and OpenMP”, In-ternational Journal of HPC Applications, to appear, 2013 .[2] M. Hestenes, E. Stiefel “Methods of Conjugate Gradients for Solving LinearSystems”, Journal of Research of the National Bureau of Standards 49, 1952 [3] Y. Saad, M. Schultz. “GMRES: A generalized minimal residual algorithm forsolving nonsymmetric linear systems”, SIAM J. Sci. Stat. Comput., 7:856-869, 1986 [4] D. O’Leary, R. White “Multi-splittings of matricies and parallel solution oflinear systems”, SIAM J. Alg. Disc. Meth., 6:630-640, 1985 [5] A. Frommer, D. Szyld “H-Splittings and two-stage iterative methods*”, Nu-merische Mathematik, Volume 63, Number 1, Page 34, 1992 [6] A. Frommer, D. Szyld “Asynchronous Two-Stage Iterative Methods”, 1994 [7] D. Daoud, B. Wade “A two stage multisplitting method for non-overlappingdomain decomposition for parabolic equations”, 12th International Confer-enec on Domain Decomposition methods, 2001 [8] C. Huang “A Krylov multisplitting algorithm for solving linear systems ofequations”, Linear Algebra and its Applications, Volume 194, 15 November1993, Pages 929 [9] S. Balay, K. Buschelman, V. Eijkhout, W. Gropp, D. Kaushik, M. Knep-ley, L. Curfman, B. Smith, H. Zhang. “PETSc Users Manual”, ANL-95/11Revision 2.3.2, Argonne National Laboratory, 2006 [10] D. Chazan, W. Miranker “Chaotic relaxation”, Linear Algebra and Its Ap-plications, 2:199222, 1960“Chaotic relaxation”, Linear Algebra and Its Ap-plications, 2:199222, 1960