[PDF] A Decentralized Parallelization-in-Time Approach with Parareal

Abstract

With steadily increasing parallelism for high-performance architectures, simulations requiring a good strong scalability are prone to be limited in scalability with standard spatial-decomposition strategies at a certain amount of parallel processors. This can be a show-stopper if the simulation results have to be computed with wallclock time restrictions (e.g.\,for weather forecasts) or as fast as possible (e.g. for urgent computing). Here, the time-dimension is the only one left for parallelization and we focus on Parareal as one particular parallelization-in-time method. We discuss a software approach for making Parareal parallelization transparent for application developers, hence allowing fast prototyping for Parareal. Further, we introduce a decentralized Parareal which results in autonomous simulation instances which only require communicating with the previous and next simulation instances, hence with strong locality for communication. This concept is evaluated by a prototypical solver for the rotational shallow-water equations which we use as a representative black-box solver.

Full PDF

TTechnical report:A Decentralized Parallelization-in-Time Approach with Parareal

Martin Schreiber

University of ExeterCEMPSUnited [email protected]

Adam Peddle

University of ExeterCEMPSUnited [email protected]

Terry Haut

Los Alamos National LaboratoryCNLS, MS B258Los [email protected]

Beth Wingate

University of ExeterCEMPSUnited [email protected]

Abstract —With steadily increasing parallelism for high-performance architectures, simulations requiring a good strongscalability are prone to be limited in scalability with standardspatial-decomposition strategies at a certain amount of parallelprocessors. This can be a show-stopper if the simulationresults have to be computed with wallclock time restrictions(e.g. for weather forecasts) or as fast as possible (e.g. for urgentcomputing). Here, the time-dimension is the only one left forparallelization and we focus on Parareal as one particularparallelization-in-time method.We discuss a software approach for making Pararealparallelization transparent for application developers, henceallowing fast prototyping for Parareal. Further, we introduce adecentralized Parareal which results in autonomous simulationinstances which only require communicating with the previousand next simulation instances, hence with strong locality forcommunication. This concept is evaluated by a prototypicalsolver for the rotational shallow-water equations which we useas a representative black-box solver.

Keywords -high-performance computing, parallelization intime, parareal, decentralized

I. I

NTRODUCTION

Over the last decade an improvement in the performanceof simulations executed on supercomputers has been accom-plished by increasing the number of parallel data processingpipelines on the core as well as on the instruction level.This is in contrast to previous decades where performancewas mainly improved through increasing the CPU’s clockrate which nowadays almost stagnated (see [1]). This recenttype of architectural development has a signiﬁcant impact onstrong scaling problems: the spatial decomposition is at onepoint dominated by the communication latencies at a ﬁxednumber of processors and no further improvement in perfor-mance can be achieved through the utilization of more com-puting cores. In combination with MPI-related restrictions,using more cores can even lead to less performance. Sincethe trend of increasing supercomputer performance throughmore data parallelisation is likely to continue, this will havea signiﬁcant impact on the future of HPC applications andin particular for problems with strong scaling. In this paper,we focus on simulations with run-time requirements suchas sub-realtime (for weather and climate, e.g.). With the aforementioned tendency to increase the number of paralleldata processing pipelines, this requires exploiting new waysof parallelisation, including those gained by using insightsfrom novel mathematical formulations.Our work is based on the parallel-in-time iterative methodcalled ’Parareal’ [2]. Here, the simulation time is dividedinto coarse time intervals . Two different propagators areused: a ﬁne propagator (also the default for standard space-parallelisation methods) and a coarse propagator, which hasto be of lower complexity than the ﬁne propagator over thecoarse time interval. A coarse propagator (approximation) isused to compute an approximation of the solution at the startof each coarse time interval. The Parareal parallel-in-timemethod then uses the coarse propagator to estimate solutionsat the end of the coarse time intervals. This is followed bya combination of ﬁne and coarse propagators in each coarsetime interval and is used as an iterative method to improvethe approximated solution. This can be executed massivelyparallel since computations on all intervals are independentin space. Such an approach can be implemented event-based with a dynamic task scheduling library [3] or with acentralized manager-worker task distribution [4]. However,the potentials of the locality properties of the data ﬂowwith the Parareal method were so far not considered inthese works. In this work we use a black-box solver whichrepresents an experimental implementation of the rotatingshallow water equations (RSWE).II. P

ARAREAL

Here, we give an overview of the Parareal algorithm whichis employed in this work for the parallelization-in-time. Thisalgorithm was initially presented in [2], but has its roots inearlier works by [5]. Particular attention has been paid to theparallel implementation of the Parareal method by [6]. For amore detailed review of the history of the Parareal method,the reader is referred to [7]. A sketch of the algorithm forODEs is given in Fig. 2.Consider some general system of Partial Differential a r X i v : . [ c s . D C ] F e b igure 1. Water surface height of the initial condition given by a Gaussiandistribution (top image) and selected solutions of the rotational shallowwater equations (RSWE) which we solve in this work (bottom image). S o l u t i o n o f t h e s i m u l a t i o n U Time d1d1 d2d2 d3d3

Coarse time step of previous iteration Fine time steps of previous iteration Di ﬀ erence between ﬁ neand coarse time step Correction applied tocurrent coarse time step

Next ﬁ ne time stepsbased on correctedcoarse time step Coarse time step of current iteration Figure 2. Sketch of the Parareal algorithm for second iteration with thefocus on the third coarse time interval: After computing the coarse timestep(1) and the ﬁne time stepping (2) as part of the ﬁrst iteration, the differenceis computed and buffered (3). Then, the next coarse time step is executedbased on updated initial values for this coarse time interval (4). The resultof this coarse time step is then corrected with the previously computeddifference (5).

Equations of the form: d U d t = f ( U ) , U (0) = U , t ∈ [0 , T ] . (1)Where f ( U ) is some differential operator which is notnecessarily linear. The Parareal algorithm is deﬁned bytwo propagation operators over the [ t n , t n +1 ] time interval, G ( t n , U n ) , termed the coarse propagator , and F ( t n , U n ) ,termed the ﬁne propagator . For the ﬁrst time step, thecoarse propagator provides a coarse approximation to thesolution with the initial condition U ( t ) = U , while theﬁne operator provides U ( t ) with the desired accuracy.We begin with some initial approximation, U n , for n = { , , , . . . , N } corresponding to times t n . This approxi-mation is found by the application, in series, of the coarsepropagator, i.e.: U n +1 = G ( t n , U n ) , U = U . (2)We then apply the correction iteration for k = 0 , , , . . . : U k +1 n +1 = G ( t n , U k +1 n ) + F ( t n , U kn ) − G ( t n , U kn ) . (3)We shall term equation (3) the Parareal algorithm. We notethat as k → ∞ , it converges to: U N = ˆ F ( t N , t , U ) (4)with ˆ F computing the ﬁne time steps from t to t N . Thatis to say, the Parareal algorithm converges to the accuracyof the ﬁne propagator. It has been proposed that the useof the Parareal algorithm permits this level of accuracy tobe achieved more quickly in terms of wallclock time. Onlythe ﬁrst step (equation (2)) must be performed sequentiallyin time. For equation (3), there is no requirement of serialtime and so processors which would otherwise be unusedmay now be used to reﬁne the approximation.It is worth noting that the ﬁne propagator must solvethe governing equation fully and to the desired accuracy.Several different approaches have been taken to the coarsepropagator, however. G may in practice derive from a coarsertimestep (e.g. [2]), a coarser space discretisation (e.g. [8]),a simpler physical model (e.g. [9]) and an exponentialintegrator (e.g. [10]) on which we further focus on.III. D ECENTRALIZED P ARAREAL

We introduce a software approach to allow a reutilizationof the software for implementations of different kind ofsolvers and we present the required interfaces in SectionIII-A. The Parareal controller implements the logic behindthe decentralized Parareal implementation and is presentedin Section III-B. . Simulation layer

Each rank executes one instance of the simulation for agiven coarse time interval. With the Parareal algorithm givenin its generic form (see [2] and Section II), the MPI paral-lelization can be hidden from the simulation developer. Inthis Section, we describe the required interfaces with a focuson making the parallelization-in-time via MPI transparent tothe simulation developer. Please note that for sake of clarity,we skip the description of debugging and plotting features.We group the interfaces in three different types: Setup, timestepping and Parareal difference/correction. Several buffersare used and are denoted by u { S,F,C,D,O } (Start, Fine,Coarse, Difference, Output).

1) Setup:

The setup routines either depend on the initialconditions at t := 0 or simulation data forwarded by aprevious coarse time frame. • constructor() :Constructor method for one-time-only initialization ofthe simulation instance. • setSimulationTimeframe( t start , t end ) :Set the time frame for the coarse time interval. • setupInitialValues() :Setup the initial values at t = 0 • setSimulationData(data) :Set simulation data u S := data .The constructor initializes the simulation only once foreach rank. This allows an efﬁcient sliding window byonly requiring to set the new simulation time frame via setSimulationTimeframe and by the new initial values via setSimulationData without requiring reinitializing e.g. FFTcomputations.

2) Timestepping:

The timestepping interfaces are re-quired to execute the ﬁne and coarse timesteps. The resultsof these time stepping methods are then made available via get ters. • runTimestepFine() Compute the solution at t end with the ﬁne timesteppingmethod: u F := F ( t end , t start , u S ) • runTimestepCoarse() Compute the solution at t end with the coarse timestep-ping method: u C := G ( t end , t start , u S ) • getDataTimestepFine() Return the solution u F • getDataTimestepCoarse() Return the solution u C

3) Parareal difference/correction:

Finally, the solutionsof the different timestep methods have to be merged together(see Eq. (3)) in a certain way without race conditions whichcan be accomplished by the following interfaces: • computeDifference() Compute the difference between the ﬁne and coarsetime stepping u D := u F − u C . ﬁ rst in slidingwindow › ﬁ ne time step setup › setup initial values at t=0› run timestep coarse› send coarse timestep data last converged › recv data idle › recv data› set simulation data› run timestep coarse› send coarse timestep data follower insliding window › ﬁ ne time step› compute error› recv coarse data› run timestep coarse› correction of coarse data› convergence check› if converged: status = idle› send output data exit c o n v e r g e d a nd l a s t t i m e i n t e r v . last coarsetime intervalsimulation outputdata available andnot ﬁ rst in slidingwindow convergedand not lasttimesliceno convergence c o n v e r g e d m o r e c o a r s e t i m e i n t e r v a l s l e f t exit signalreceived s i m u l a t i o n d a t a a v a il a b l e a n d ﬁ r s t i n s li d i n g w i n . Figure 3. Overview of the different states of the Parareal controllersimulation instances. Each box represents one of the six states. A shortdescription of the most important operations executed for each state is givenin below each state box. The transitions depend on the convergence or statebehaviour. The receive/send operations are done from/to the previous/nextranks only. • computeOutputData() Compute the data to be forwarded to the next timestepby applying a correction to the solution from thecoarse timestep: u O := u C + u D . Note, that u C isbased on the coarse timestep executed after calling computeDifference . • getOutputData() Return the reference to the data u O to be forwarded tothe next coarse timestep interval. • getErrorEstimation() Return a scalar value as an error estimation. This istypically based on a norm of the computed solutionand is required for the convergence test.These interfaces contain no information on the adjacentMPI ranks and hide the connectivity from the simulationdeveloper. Next, we discuss the logic which triggers theexecution of these interfaces and which orchestrates severalcoarse timestep intervals.

B. Parareal controller

The Parareal controller implements the entire logic behindour decentralized Parareal approach. It is mainly based ona state machine with the transitions depending on (a) thecurrent state, (b) the state information forwarded by the pre-vious rank and (c) the convergence test. After initialization,the ﬁrst rank is set to the [setup] state and all other ranksto the [idle] state. All possible states are discussed in moredetail in the following list and an overview is given in Fig. 3. [setup]

The ﬁrst rank i = 0 is set to the [setup] state. Thistriggers the setup of the initial values at t=0. Then, acoarse timestep is computed and the data forwarded torank . After this setup, the state changes to [ﬁrst insliding window]. • [ﬁrst in sliding window] A ﬁne timestep is executed. Since this is the ﬁrst coarsetime interval in the sliding window, further Pararealiterations would not yield an improvement in the solu-tion. Therefore, the solution of the just computed ﬁnetimestep is forwarded to the next rank and the state isset to idle. • [follower in sliding window] A follower in the sliding window ﬁrst executes the ﬁne timesteps (runTimestepFine). Then, the difference between the solution of the coarse and ﬁne timestepsare computed (computeDifference). This is followedby waiting for new simulation data at t start fromthe previous rank which is then used for executing acoarse timestep (runTimestepCoarse). The solution ofthe coarse timestep is then corrected by the previouslycomputed difference and the result forwarded to thenext rank (computeOutputData).The state change depends on the previous coarse timeinterval: In case of no convergence of the previouscoarse time interval , the state is unchanged . Witha convergence in the previous coarse time intervaland the current one , the state is changed to [idle] . Otherwise , the state of the coarse time interval becomesthe [ﬁrst in the sliding window] . • [idle] An idling state checks for messages from previousranks. Due to our asynchronous and decentralized ap-proach, it is possible that more than one simulationdata states are already enqueued in the receive buffer.Therefore, we probe for such additional messages andin the case that new simulation data is already available,we drop the previous one and read this next message.Depending on the state of the previous rank, the stateis changed to [follower in sliding window] or [ﬁrst insliding window] . In case of receiving a converged statefrom the previous rank, the state is changed to [lastconverged] . • [last converged] This state can be only reached if the last coarsetime interval in the entire simulation time frame wasreached. A transition to this state is either triggeredvia the ﬁrst/follower in case of a convergence of thelast coarse time interval or by receiving this state bythe previous coarse time interval (see transition from [idle] ). During this state, messages from previous ranksare still received to assure that no network congestionoccurs. After transition to this state, the [exit] state is send to the next rank who can receive this messageonly, if all other simulation data messages were read. • [exit] With the algorithm presented in [last converged] , atransition to [exit] is done if receiving the [exit] signal.After assuring that all messages were send in thesending queue, this instance of the coarse time intervalof the Parareal simulation exits.IV. R

ESULTS

We conducted several studies based on an experimentalimplementation of the rotational shallow water equationsin combination with our decentralized parareal paralleli-sation (Sec. III). These studies were focused on a par-ticular set of parameters which are set as follows: Thebenchmarks were conducted with different resolutions r ∈{ , , , , } for the simulation domain. We usea ﬁne time step size of . which is sufﬁciently smallfor stability reasons for all values of r regarding the CFLcondition and by using an exponential integrator for thelinear part. For the Parareal method, we use a coarse timestep size of . , hence we execute 100 ﬁne time stepsizes within a single coarse time step. We use a Gaussiandistribution exp − x − π ) + ( y − π ) ) for the initialvalues. The simulation is executed over 40 seconds of wavepropagations. The convergence test is based on the datawhich is forwarded to the next coarse time interval. Here,we use the minimum of the L and L max norm and setthe threshold for the convergence test to − . For sake ofreproducibility, the source code for the Parareal frameworkis available for download [11]. The blackbox RSWE solvercan be requested from the 2nd author.We conducted scalability benchmarks for up to 128 coreswith the results given in Fig. 4. Regarding a parallelization-in-space, we expect that there will be no scalability pos-sible across several MPI compute nodes for the consid-ered problem sizes. Indeed, a scalability even on shared-memory many-core systems which would not suffer of MPIcommunication overheads is hardly feasible due to a two-dimensional problem of about × . Here, the non-parallelizable parts in the parallel execution (threading over-heads, cache-synchronisation, NUMA effects, bandwidthlimitation, etc.) would dominate and signiﬁcantly restrictthe scalability already on such shared-memory systems.Therefore, we focus in the following on a parallelization-in-time only.The runtime was restricted to 30 minutes to account forreal-time requirements which was the original motivationof the Parareal approach. For the single-core performance,we only used the ﬁne time stepping method without anycommunication and Parareal overheads. This performanceis used as the baseline in the following. We can see anincrease of wallclock time for executing the simulation ontwo cores. We account for that by (a) the additional timeequired for the coarse time stepping, (b) the communicationoverheads of sending the simulation data to the next MPIrank and (c) the convergence test which requires at least twoiterations. In particular because of issue (c), there cannotbe any performance increase of the Parareal method withonly two coarse time steps and by using a convergencetest. By using four coarse time intervals in the slidingwindow, we already gain a robust speedup for all consideredresolutions. Utilizing cores for the parallelization-in-time, we get speedups of (8 . , . , . , . for theresolutions (8 , , , ) , respectively.R EFERENCES [1] P. M. Kogge and T. J. Dysart, “Using the top500 to traceand project technology and architecture trends,” in

Proceed-ings of 2011 International Conference for High PerformanceComputing, Networking, Storage and Analysis . ACM, 2011,p. 28.[2] J.-L. Lions, Y. Maday, and G. Turinici, “R´esolution d’edppar un sch´ema en temps parar´eel,”

Comptes Rendus del’Acad´emie des Sciences-Series I-Mathematics , vol. 332,no. 7, pp. 661–668, 2001.[3] W. R. Elwasif, S. S. Foley, D. E. Bernholdt, L. A.Berry, D. Samaddar, D. E. Newman, and R. Sanchez, “Adependency-driven formulation of parareal: parallel-in-timesolution of pdes as a many-task application,” in

Proceedingsof the 2011 ACM international workshop on Many taskcomputing on grids and supercomputers . ACM, 2011, pp.15–24.[4] E. Aubanel, “Scheduling of tasks in the parareal algorithm,”

Parallel Computing

Commun. ACM , vol. 7, no. 12, pp. 731–733, Dec. 1964.[6] Y. Maday and G. Turinici, “The parareal in time iterativesolver: a further direction to parallel implementation,” in

Domain decomposition methods in science and engineering .Springer Berlin Heidelberg, 2005, pp. 441–448.[7] M. J. Gander, “50 years of time parallel time integration,” in

Multiple Shooting and Time Domain Decomposition , T. Car-raro, M. Geiger, S. Korkel, and R. Rannacher, Eds. Springer-Verlag, 2015.[8] P. F. Fischer, F. Hecht, and Y. Maday, “A parareal in timesemi-implicit approximation of the navier-stokes equations,”in

Domain Decomposition Methods in Science and Engi-neering , ser. Lecture Notes in Computational Science andEngineering, T. J. Barth, M. Griebel, D. E. Keyes, R. M.Nieminen, D. Roose, T. Schlick, R. Kornhuber, R. Hoppe,J. Priaux, O. Pironneau, O. Widlund, and J. Xu, Eds. SpringerBerlin Heidelberg, 2005, vol. 40, pp. 433–440.[9] Y. Maday and G. Turinici, “Parallel in time algorithmsfor quantum control: Parareal time discretization scheme,”

International journal of quantum chemistry , vol. 93, no. 3,pp. 223–228, 2003. [10] M. J. Gander and S. Guttel, “Paraexp: A parallel integratorfor linear initial-value problems,”

SIAM Journal on ScientiﬁcComputing , vol. 35, no. 2, pp. C123–C142, 2013.[11] M. Schreiber et al. W A LL C L O C K T I M E F O R S I M U L A T I O N res=8x8 res=16x16 res=32x32 res=64x64 res=128x128 Figure 4. Wallclock time for solving the RSWE with different resolutions. Since we focus on problems with a limitation on the time-to-solution, we plottedthe results with the wallclock time in the y-axis for a better comparison. The wallclock time for a single core was computed with the ﬁne timesteppingonly and the overall runtime was restricted to 30 minutes. For resolution , we see an increase of runtime with two cores. This is due to additionalcomputational costs of the coarse time step and communication overheads. For a higher number of cores, there is a robust decrease of computation timewith a speedup of . for using 128 cores. A simulation with the resolution of2