[PDF] Twelve Ways To Fool The Masses When Giving Parallel-In-Time Results

Abstract

Getting good speedup -- let alone high parallel efficiency -- for parallel-in-time (PinT) integration examples can be frustratingly difficult. The high complexity and large number of parameters in PinT methods can easily (and unintentionally) lead to numerical experiments that overestimate the algorithm's performance. In the tradition of Bailey's article "Twelve ways to fool the masses when giving performance results on parallel computers", we discuss and demonstrate pitfalls to avoid when evaluating performance of PinT methods. Despite being written in a light-hearted tone, this paper is intended to raise awareness that there are many ways to unintentionally fool yourself and others and that by avoiding these fallacies more meaningful PinT performance results can be obtained.

Full PDF

TTwelve Ways To Fool The Masses When GivingParallel-In-Time Results

Sebastian G¨otschel , Michael Minion , Daniel Ruprecht , and Robert Speck Hamburg University of TechnologyInstitute of MathematicsChair Computational MathematicsAm Schwarzenberg-Campus 3D-21073 Hamburg, Germany { ruprecht, sebastian.goetschel } @tuhh.de Lawrence Berkeley National Laboratory1 Cyclotron RoadBerkeley, CA 94720, USA [email protected] Forschungszentrum J¨ulich GmbHJ¨ulich Supercomputing Centre52425 J¨ulich, Germany [email protected]

Abstract.

Getting good speedup—let alone high parallel eﬃciency—for parallel-in-time (PinT) integration examples can be frustratingly dif-ﬁcult. The high complexity and large number of parameters in PinTmethods can easily (and unintentionally) lead to numerical experimentsthat overestimate the algorithm’s performance. In the tradition of Bai-ley’s article “Twelve ways to fool the masses when giving performance re-sults on parallel computers”, we discuss and demonstrate pitfalls to avoidwhen evaluating performance of PinT methods. Despite being written ina light-hearted tone, this paper is intended to raise awareness that thereare many ways to unintentionally fool yourself and others and that byavoiding these fallacies more meaningful PinT performance results canbe obtained.

The trend towards extreme parallelism in high-performance computing requiresnovel numerical algorithms to translate the raw computing power of hardwareinto application performance [5]. Methods for the approximation of time-depen-dent partial diﬀerential equations, which are used in models in a very widerange of disciplines from engineering to physics, biology or even sociology, pose aparticular challenge in this respect. Parallelization of algorithms discretizing thespatial dimension via a form of domain decomposition is quite natural and hasbeen an active research topic for decades. Exploiting parallelization in the timedirection is less intuitive as time has a clear direction of information transport.Traditional algorithms for temporal integration employ a step-by-step procedure a r X i v : . [ m a t h . NA ] F e b Sebastian G¨otschel, Michael Minion, Daniel Ruprecht, and Robert Speck that is diﬃcult to parallelize. In many applications, this sequential treatment oftemporal integration has become a bottleneck in massively parallel simulations.Parallel-in-time (PinT) methods, i.e., methods that oﬀer at least some de-gree of concurrency, are advertised as a possible solution to this temporal bottle-neck. The concept was pioneered by Nievergelt in 1964 [15], but has only reallygained traction in the last two decades [7]. By now, the eﬀectiveness of PinThas been well established for examples ranging from the linear heat equationin one-dimension to more complex highly diﬀusive problems in more than onedimension. More importantly, there is now ample evidence that diﬀerent PinTmethods can deliver measureable reduction in solution times on real-life HPCsystems for a wide variety of problems. Ong and Schroder [16] and Gander [7]provide overviews of the literature, and a good resource for further reading isalso given by the community website https://parallel-in-time . org/ .PinT methods diﬀer from space-parallel algorithms or parallel methods foroperations like the FFT in that they do not simply parallelize a serial algorithmto reduce its run time. Instead, serial time-stepping is usually replaced with acomputationally more costly and typically iterative procedure that is amenableto parallelization. Such a procedure will run much slower in serial, but can over-take serial time-stepping in speed if suﬃciently many processors are employed.This makes a fair assessment of performance much harder since there is no clearbaseline to compare against. Together with the large number of parameters andinherent complexities in PinT methods and PDEs themselves, there are thusmany sometimes subtle ways to fool oneself (and the masses) when assessingperformance. We will demonstrate various ways to produce results that seem todemonstrate speedup but are essentially meaningless. The paper is written in asimilar spirit as other “ways to fool the masses” papers ﬁrst introduced in [3] whoinspired a series of similarly helpful papers in related areas [10,11,14,9,18,4,17].One departure from the canon here is that we provide actual examples to demon-strate the

Ways as we present them. Despite the light-hearted, sometimes evensarcastic tone of the discussion, the numerical examples are similar to experi-ments one could do for evaluating the performance of PinT methods.Some of the ways we present are speciﬁc to PinT while others, althoughformulated in “PinT language” correspond to broader concepts from parallelcomputing. This illustrates another important fact about PinT: while the algo-rithms often dig deeply into the applied mathematics toolkit, their relevance isexclusively due to the architectural speciﬁcs of modern high-performance com-puting systems. This inherent cross-disciplinarity is another complicating factorwhen trying to do fair performance assessments. Lastly we note that this paperwas ﬁrst presented in a shorter form as a conference talk at the 9th Parallel inTime Workshop held (virtually) in June, 2020. Hence some of the

Ways are morerelevant to live presentations, although all should be considered in both writtenand live scenarios. In the next section we present the 12

Ways with a series One exception are so-called “parallel-across-the-method” PinT methods in the ter-minology by Gear [8] that can deliver smaller-scale parallelism.welve Ways To Fool The Masses When Giving Parallel-In-Time Results 3 of numerical examples before concluding with some more serious comments inSection 3.

If you really want to impress the masses with your PinT results, you will wantto show as big a parallel speedup as possible, hence you will want to use a lotof processors in the time direction. If you are using, for example, Parareal [13],a theoretical model for speedup is given by the expression S theory = N P N P α + K (1 + α ) , (1)where N P is the number of processors, α is the ratio of the cost of the coarsepropagator G compared to the ﬁne propagator F , and K is the total number ofiterations needed for convergence. Hence to get a large speedup that will impressthe masses, we need to choose N P to be large, α to be small, and hope K issmall as well. A common choice for parareal is to have G be one step of somemethod and F be N F steps of the same method so that α = 1 /N F is small. Butnote that this already means that the total number of time steps correspondingto the serial method is now N P N F . Hence we want to choose an equation andproblem parameters for which very many time steps can be employed, while stillshowing good speedup without raising any suspicions that the problem is too“easy”. The ﬁrst example suggests some Ways to pull oﬀ this perilous balancingact.In this example, we use the following nonlinear advection-diffusion-reactionequation u t = vu x + γuu x + νu xx + βu ( a − u )( b − u ) , where the constants v, γ, ν, β, a , and b determine the strength of each term. Inorder to squeeze in the massive number of time steps we need for good speedup,we choose a long time interval over which to integrate, t ∈ [0 , T F ], with T F = 30.The initial condition is given on the interval [0 , π ] by u ( x,

0) = 1 − d (1 − e − ( x − π ) /σ ) . If you are presenting this example in front of the an audience, try to get all theequations with parameters on one slide and then move on before deﬁning them.

Way 1.

Choose a seemingly complicated problem with lots of parameterswhich you deﬁne later (or not at all).

For the ﬁrst numerical test, we choose N P = 200 processors, and use a fourth-order IMEX Runge-Kutta method and a pseudo-spectral discretization in space Sebastian G¨otschel, Michael Minion, Daniel Ruprecht, and Robert Speck using 128 grid points, where the linear advection and diﬀusion terms are treatedimplicitly. We use one time step for G and N F = 64 steps for F . Since themethod is spectrally accurate in space, it gives us cover to use a lot of time steps(more on that later). We set the stopping criterion for Parareal to be when theincrement in the iteration is below 10 − , and v = − . γ = 0 . ν = 0 . β = − a = 1, b = 0, and d = 0 .

55 (see also Appendix 1). For these values,Parareal converges on the entire time interval in 3 iterations. The theoreticalspeedup given by Eq. 1 is 32.4. Not bad!If we explore no further, we might have fooled the masses. How did we man-age? Consider a plot of the initial condition and solution at the ﬁnal time forthis problem shown in Fig. 1, with the blue and orange lines respectively. Thelesson here is

Way 2.

Quietly use an initial condition and/or problem parameters forwhich the solution tends to a steady state. But do not show the actualsolution.

Fig. 1: Initial solution and solution at t = 30 for the advection-diﬀusion-reactionproblem demonstrating the signiﬁcant eﬀect that parameter selection can haveon the dynamics and subsequent PinT speedup discussed in Ways b = 0 .

5, thenumber of Parareal iterations needed for convergence jumps to K = 10 for a lessimpressive theoretical speedup of 15 .

05. In this case the solution quickly evolvesnot to constant state, but a steady bump moving at constant speed (the greenline in Fig. 1). This raises another important point to fool the masses: welve Ways To Fool The Masses When Giving Parallel-In-Time Results 5

Way 3.

Do not show the sensitivity of your results to problem parame-ter changes. Find the best one and let the audience think the behavior isgeneric.

Sometimes you might be faced with a situation like the second case above andnot know how to get a better speedup. One suggestion is to add more diﬀusion.Using the same parameters except increasing the diﬀusive coeﬃcient to ν = 0 . K = 5 with a theoreticalspeedup of 24 .

38. The solution of this third example is shown by the red lineFig. 1. If you can’t add diﬀusion directly, using a diﬀusive discretization foradvection like ﬁrst-order upwind ﬁnite diﬀerences can sometimes do the trickwhile avoiding the need to explicitly admit to the audience that you needed toincrease the amount of “diﬀusion”.

Way 4.

If you are not completely thrilled about the speedup because thenumber of iterations K is too high, try adding more diﬀusion. You mighthave to experiment a little to ﬁnd just the right amount. After carefully choosing your test problem, there are ample additional opportu-nities to boost the parallel performance of your numerical results. The next setof

Ways consider the eﬀect of spatial and temporal resolution. We consider the1D nonlinear Schr¨odinger equation u t = i ∆u + 2i | u | u (2)with periodic boundary conditions in [0 , π ] and the exact solution as given byAktosun et al. [1], which we also use for the initial condition at time t = 0.This is a notoriously hard problem for out-of-the-box PinT methods, but weare optimistic and give Parareal a try. We use a second-order IMEX Runge-Kuttamethod by Ascher et al. [2] with N F = 1 024 steps and N G = 32 coarse stepsfor each of the 32 processors. In space, we again use a pseudo-spectral methodwith the linear part treated implicitly and N x = 32 degrees-of-freedom. Theestimated speedup can be found in Fig. 2a. Using K = 5 iterations, we obtaina solution about 6 .

24 times faster when running with 32 instead of 1 processor.All runs achieve the same accuracy of 5 . × − and it looks like speedup intime can be easily achieved after all.Yet, although the accuracy compared to the exact solution is the same forall runs, the temporal resolution is way too high for this problem, maskingthe eﬀect of coarsening in time. The spatial error is dominating and instead of32 × ×

32 = 1 024 are actually neededto balance spatial and temporal error. Therefore, the coarse level already solvesthe problem quite well: speedup only comes from over-resolving in time.If we instead choose N F = 32 time steps on the ﬁne level instead with thesame coarsening factor of 32 ( N G = 1), we get no speedup at all – see the Sebastian G¨otschel, Michael Minion, Daniel Ruprecht, and Robert Speck red curve/diamond markers in Fig. 2b. Using a less drastic coarsening factor of4 leads to a maximum speedup of 1 .

78 with 32 processors (blue curve/squaremarkers), which is underwhelming and frustrating and not what we would preferto present in public. Lesson learned:

Way 5.

Make ∆t so small that the coarse integrator is already accurate.Never check if a larger ∆t might give you the same solution. The astute readers may have noticed we also used this trick to a lesser extent inthe advection-diﬀusion-reaction example above. T h e o r e t i c a l s p ee dup N F = 1024 , α = 132 (a) Deceivingly good speedup T h e o r e t i c a l s p ee dup N F = 32 , α = 132 N F = 32 , α = 14 (b) Not so good speedup Fig. 2: Estimated speedup for Parareal runs of the nonlinear Schr¨odinger exam-ple (2) demonstrating the eﬀect of over-resolution in time (

Way . × − . We do not coarsen in time,but – impressing everybody with how resilient PFASST is to spatial coarsening– go from 512 degrees-of-freedom on the ﬁne to 32 on the coarse level. We arerewarded with the impressive speedup shown in Fig. 3a: using 8 processors,we are 5 . − . So using 512 degrees-of-freedom on the ﬁne levelheavily over-resolves the problem in space. Using only the required 32 degrees-of-freedom on the ﬁne level with a similar coarsening factor of 4 only gives a speedupof 2 .

7, see Fig. 2b (red curve/diamond markers). While we could probably sneak welve Ways To Fool The Masses When Giving Parallel-In-Time Results 7 this into an article, the parallel eﬃciency of 34% will hardly impress anybodyoutside of the PinT community.It is worth noting that better resolution in space on the coarse level doesnot help (blue curve/square markers). This is because the coarse level does notcontribute anything to the convergence of the method anymore. Turning it oﬀcompletely would even increase the theoretical speedup to about 3 .

5. Hence, formaximum eﬀect:

Way 6.

When coarsening in space, make ∆x on the ﬁne level so smallthat even after coarsening, the coarse integrator is accurate. Avoid thetemptation to explore a more reasonable resolution. T h e o r e t i c a l s p ee dup Nx F = 512 , α = 14 (a) Deceivingly good speedup T h e o r e t i c a l s p ee dup Nx F = 32 , α = 14 Nx F = 32 , α = 12 (b) Not so good speedup Fig. 3: Estimated speedup for PFASST runs of the nonlinear Schr¨odinger exam-ple (2) demonstrating the eﬀect of over-resolution in space (

Way

If the audience catches on about your ∆t / ∆x over-resolution issues, there is amore subtle way to over-resolve and fool the masses. Since methods like Pararealand PFASST are iterative methods, one must decide when to stop iterating -use this to your advantage! The standard approach is to check the incrementbetween two iterations or some sort of residual (if you can, use the latter: itsounds fancier and people will ask fewer questions). In the runs shown above,Parareal is stopped when the diﬀerence between two iterates is below 10 − andPFASST is stopped when the residual of the local collocation problems is below10 − .These are good choices, as they give you good speedup: for the PFASSTexample, a threshold of 10 − would have been suﬃcient to reach the accuracyof the serial method. While this leads to fewer PFASST iterations (good!), un-fortunately it also makes the serial SDC baseline much faster (bad!). Therefore, Sebastian G¨otschel, Michael Minion, Daniel Ruprecht, and Robert Speck with the higher tolerance, speedup looks much less attractive, even in the over-resolved case, see Fig. 4a.Similarly, when using more reasonable tolerances, the speedup of the well-resolved examples decreases as shown in Fig. 4b. This leads to our next

Way ,which has a long and proud tradition, and for which we can therefore quotePakin [17] directly,

Way 7. “Hence, to demonstrate good [...] performance, always run farmore iterations than are typical, necessary, practical, or even meaningfulfor real-world usage, numerics be damned!” T h e o r e t i c a l s p ee dup Nx F = 512 , α = 14 (a) Now also not so good speedup T h e o r e t i c a l s p ee dup Nx F = 32 , α = 14 Nx F = 32 , α = 12 (b) Still not so good speedup Fig. 4: Estimated speedup for PFASST runs of the nonlinear Schr¨odinger exam-ple (2) with diﬀerent resolutions in space demonstrating how using a sensibleiteration tolerance of 10 − can reduce speedup ( Way

Way 8.

Not only use too many outer iterations, but try to maximize theamount of work done by iterative spatial solvers (if you have one, and youalways should).

Note that for all the examples presented so far, we did not report any actualspeedups measured on parallel computers. Parallel programming is tedious anddiﬃcult, as everybody understands, and what do we have performance modelsfor, anyway? It is easier to just plug your parameters into a theoretical model.Realizing this on an actual system can be rightfully considered Somebody Else’s welve Ways To Fool The Masses When Giving Parallel-In-Time Results 9

Problem (SEP) or a task for your dear future self. But for completeness, thenext example will address this directly.

Because solving PDEs only once can bore an audience, we will now talk aboutoptimal control of the heat equation, the “hello world” example in optimizationwith time-dependent PDEs. This problems has the additional advantage thateven more parameters are available to tune. Our speciﬁc problem is as follows.Given some desired state u d on a space-time domain Ω × (0 , T ), Ω ⊂ R d , we aimto ﬁnd a control c to minimize the objective functional J ( u, c ) = 12 (cid:90) T (cid:107) u − u d (cid:107) L ( Ω ) d t + λ (cid:90) T (cid:107) c (cid:107) L ( Ω ) d t subject to u t − ∇ u = c + f ( u )with periodic boundary conditions (allowing us to use FFT to evaluate the Lapla-cian and perform the implicit linear solves ( Way f ( u ) ≡ Way u and adjoint p , u t − ∇ u = c + f ( u ) − p t − ∇ p − f (cid:48) ( u ) p = u − u d u ( · ,

0) = 0 p ( · , T ) = 0 . To parallelize in time, we use, for illustration, the most simple approach:given a control c , the state equation is solved parallel-in-time for u , followed bysolving the adjoint equation parallel-in-time for p with PFASST using N P = 20processors. For discretization, we use 20 time steps and three levels with 2/3/5Lobatto IIIA nodes in time as well as 16/32/64 degrees of freedom in space.As a sequential reference we use MLSDC on the same discretization. We letPFASST/MLSDC iterate until the residual is below 10 − instead of iterating tohigh precision, so we can openly boast how we avoid Way c ≡ u ≡ Way S = total MLSDC iterations state + adjointtotal iterations state on CP U + total iterations adjoint on CP U . We get S = = 7 . , for a nice parallel eﬃciency of 35 . Before we publish this, we might consider actual timings from a real com-puter. Unfortunately, using wall clock times instead of iterations gives S = serial wall clock timeparallel wall clock time = 44 . s . s = 2 . , and thus only roughly a third of the theoretical speedup. To avoid this embar-rassment: Way 9.

Only show theoretical/projected speedup numbers (but mentionthis only in passing or not at all). If you include the cost of communicationin the theoretical model, assume it is small enough not to aﬀect yourspeedup.

Why is the theoretical model poor here? One cause is the overhead for theoptimization—after all, there is the evaluation of the objective functional, andthe construction of gradient. Ignoring parts of your code to report better re-sults is another proud tradition of parallel computing, see Bailey’s paper [3].However, most of the tasks listed do trivially parallelize in time. The real prob-lem is that communication on an actual HPC system is aggravatingly not reallyinstantaneous.Fig. 5: Wall clock times of the diﬀerent algorithmic steps for the linear heatequation example on Ω = [0 , and T = 1. Left: total times. Right: times perlevel (1 is coarsest level, 3 ﬁnest). Note the ”receive” times are not negligible asdiscussed in Way welve Ways To Fool The Masses When Giving Parallel-In-Time Results 11 Looking at detailed timings for PFASST, Fig. 5 shows that the issue truly isin communication costs, which clearly cannot be neglected. In fact, more timeis spent on blocking coarse grid communication than on ﬁne sweeps. Note alsothat, due to the coupled forward-backward solves, each processor requires similarcomputation and communication times. The following performance model S = N PN P αK S + K P K S (1 + α + β )accounts for overheads in the β term. Matching the measured speedups requiressetting β = 3 or three times the cost of one sweep on the ﬁne level! This isneither small nor negligible by any measure. Technically, parallel speedup should be deﬁned as the ratio of the run time ofyour parallel code to the run time required by the best available serial method.But who has the time or energy for such a careful comparison? Instead, it isconvenient to choose a baseline to get as much speedup as possible.In the example above, MLSDC was used as a baseline since it is essentiallythe sequential version of PFASST and allows for a straightforward comparisonand the use of a theoretical speedup. However, MLSDC might not be the fastestserial method to solve state and adjoint equations to some prescribed tolerance.For illustration, we consider solving an optimal control problem for a nonlinearheat equation with f ( u ) = − u + u on Ω × (0 , T ) = [0 , × (0 , . s , clearly beating MLSDC (169 . s ) despiteusing signiﬁcantly more time steps. The ARK-4 method here required 183 . s ,as the non-symmetric stage values slow down the forward-backward solves dueto the required dense output. With PFASST on 32 CPUs requiring 32 s , thespeedup reduces from 5 . . Way 10.

If you report speedup based on actual timings, compare yourcode to the method run on one processor and never against a diﬀerentand possibly more eﬃcient serial method.

A low-order temporal method is a choice convenient for PinT methods becausethey are easier to implement and allow one to take many time steps withoutfalling prey to

Way

5, especially when you want to show how the speedup in-creases as you take ever more time steps for a problem on a ﬁxed time interval.After all, it is the parallel scaling that is exhilirating, not necessarily how quicklyone can compute a solution to a given accuracy.For this example we will again use Parareal applied to the Kuramoto-Siva-shinsky Equation. The K-S equation is a good choice to impress an audiencebecause it gives rise to chaotic temporal dynamics (avoiding

Way u t = − uu x − u xx − u xxxx , which we solve on the spatial interval x ∈ [0 , π ] and temporal interval t ∈ [0 , F and hence the total number of time steps.The theoretical speedups (ignoring Way

9) are displayed on the left panel of Fig.6. One can see that the Parareal method provides speedup at all temporal reso-lutions up to a maximum of about 5.85 at the ﬁnest resolution (where α is thesmallest). So we have achieved meaningful speedup with a respectable eﬃciencyfor a problem with complex dynamics. Best to stop the presentation here.If we are a little more ambitious, we might replace our ﬁrst-order integratorwith the 4th-order exponential Runge-Kutta (ERK) method from [12]. Now weneed to be more careful about Way

Way 11.

It is best to show speedup for ﬁrst-order time integrators sincethey are a bit easier to inﬂate. If you want to show speedup for higher-order methods as well, make it impossible to compare cost versus accuracybetween ﬁrst-order and higher-order methods.

The careful reader may have noticed that in all the examples above, a singlePinT method is used for each

Way . This brings us ﬁnally to welve Ways To Fool The Masses When Giving Parallel-In-Time Results 13

Fig. 6: Comparison of serial and Parareal execution time for the K-S exampleusing a ﬁrst- and fourth-order ERK integrators. Note that the serial fourth-order integrator is always faster for a given accuracy than the parallel ﬁrst-ordermethod (

Way

Way 12.

Never compare your own PinT method to a diﬀerent PinTmethod.

The problem, as we have seen, is that assessing performance for a single PinTmethod is already not straightforward. Comparing the performance of two ormore diﬀerent methods makes matters even more diﬃcult. Although it has beenoften discussed within the PinT community, eﬀorts to establish a set of bench-mark test examples have, to date, made little head way. The performance ofmethods like PFASST and Parareal considered here are highly sensitive to thetype of equation being solved, the type of spatial discretization being used, theaccuracy desired, and the choice of problem and method parameters. In thisstudy we purposely choose examples that lead to inﬂated reported speedups,and doing this required us to use our understanding of the methods and theequations chosen. Conversely, in most instances, a simple change in the exper-iment leads to much worse reported speedups. Diﬀerent PinT approaches havestrengths and weaknesses for diﬀerent benchmark scenarios, hence establishinga set of benchmarks that the community would ﬁnd fair is a very non-trivialproblem.Roughly, the ways we present can be grouped into three categories: “chooseyour problem” (ways 1–4), “over-resolve” (ways 5–8) and “choose your perfor-mance measure” (ways 9–11). This classiﬁcation is not perfect as some of the

Ways overlap. Some of the dubious tricks presented here are intentionally obviousto detect, while others are more subtle. As in the original “twelve ways” arti-cle, and those it inspired, the examples are meant to be light-hearted. However,many of the

Ways have been (unintentionally) used when reporting numericalresults, and the authors are not without guilt in this respect. Admitting that, wehope this article will be read the way we intended: as a demonstration of some of the many pitfalls one faces when assessing PinT performance and a reminderthat considerable care is required to obtain truly meaningful results.

Acknowledgements

The work of Minion was supported by the U.S. Department of Energy, Oﬃce ofScience, Oﬃce of Advanced Scientiﬁc Computing Research, Applied Mathemat-ics program under contract number DE-AC02005CH11231. Part of the simula-tions were performed using resources of the National Energy Research ScientiﬁcComputing Center (NERSC), a DOE Oﬃce of Science User Facility supportedby the Oﬃce of Science of the U.S. Department of Energy under Contract No.DE-AC02-05CH11231.

References

1. Aktosun, T., Demontis, F., Van Der Mee, C.: Exact solutions to the focusingnonlinear Schr¨odinger equation. Inverse Problems (5), 2171 (2007)2. Ascher, U.M., Ruuth, S.J., Spiteri, R.J.: Implicit-explicit Runge-Kutta methods fortime-dependent partial diﬀerential equations. Appl. Numer. Math. (2), 151–167(1997)3. Bailey, D.H.: Twelve ways to fool the masses when giving performance results onparallel computers. Supercomputing Review (8) (1991)4. Chawner, J.: Revisiting “twelve ways to fool the masses when describingmesh generation performance”. https://blog . pointwise . com/2011/05/23/revisiting-%e2%80%9ctwelve-ways-to-fool-the-masses-when-describing-mesh-generation-performance%e2%80%9d/ (2011). Accessed: 2020-4-285. Dongarra, J., et al.: Applied Mathematics Research for Exascale Comput-ing. Tech. Rep. LLNL-TR-651000, Lawrence Livermore National Laboratory(2014). URL http://science . energy . gov/%7E/media/ascr/pdf/research/am/docs/EMWGreport . pdf

6. Emmett, M., Minion, M.L.: Toward an Eﬃcient Parallel in Time Method for Par-tial Diﬀerential Equations. Communications in Applied Mathematics and Com-putational Science , 105–132 (2012). DOI 10 . . . . http://dx . doi . org/10 . . . .

7. Gander, M.J.: 50 years of Time Parallel Time Integration. In: Multiple Shootingand Time Domain Decomposition. Springer (2015). DOI 10 . http://dx . doi . org/10 .

8. Gear, C.W.: Parallel methods for ordinary diﬀerential equations. CALCOLO (1-2), 1–20 (1988). DOI 10 . http://dx . doi . org/10 .

9. Globus, A., Raible, E.: Fourteen ways to say nothing with scientiﬁc visualization.Computer (7), 86–88 (1994)10. Gustafson, J.L.: Twelve ways to fool the masses when giving performance resultson traditional vector computers. . johngustafson . net/fun/fool . html (1991). Accessed: 2020-4-2811. Hoeﬂer, T.: Twelve ways to fool the masses when reporting performance of deeplearning workloads! (not to be taken too seriously) (2018). ArXiv, 1802.09941welve Ways To Fool The Masses When Giving Parallel-In-Time Results 1512. Krogstad, S.: Generalized integrating factor methods for stiﬀ PDEs. J. Comput.Phys. (1), 72–88 (2005)13. Lions, J.L., Maday, Y., Turinici, G.: A ”parareal” in time discretization of PDE’s.Comptes Rendus de l’Acad´emie des Sciences - Series I - Mathematics , 661–668(2001). DOI 10 . http://dx . doi . org/10 .

14. Minhas, F., Asif, A., Ben-Hur, A.: Ten ways to fool the masses with machinelearning (2019). ArXiv, 1901.0168615. Nievergelt, J.: Parallel methods for integrating ordinary diﬀerential equations.Commun. ACM (12), 731–733 (1964). DOI 10 . . http://dx . doi . org/10 . .

16. Ong, B.W., Schroder, J.B.: Applications of time parallelization. Computing andVisualization in Science (2019)17. Pakin, S.: Ten ways to fool the masses when giving performance results on GPUs.HPCwire, December (2011)18. Tautges, T.J., White, D.R., Leland, R.W.: Twelve ways to fool the masses whendescribing mesh generation performance. IMR/PINRO Joint Rep. Ser. pp. 181–190(2004) Appendix 1

The value of σσ