Computational performance of a parallelized high-order spectral and mortar element toolbox
Roland Bouffanais, Vincent Keller, Ralf Gruber, Michel O. Deville
aa r X i v : . [ c s . D C ] S e p Computational performance of a parallelizedhigh-order spectral and mortar elementtoolbox
Roland Bouffanais a , ∗ , , Vincent Keller a , Ralf Gruber a ,Michel O. Deville a a Laboratory of Computational Engineering,´Ecole Polytechnique F´ed´erale de Lausanne,STI – ISE – LIN, Station 9,CH–1015 Lausanne, Switzerland
Abstract
In this paper, a comprehensive performance review of a MPI-based high-order spec-tral and mortar element method C ++ toolbox is presented. The focus is put on theperformance evaluation of several aspects with a particular emphasis on the paral-lel efficiency. The performance evaluation is analyzed and compared to predictionsgiven by a heuristic model, the so-called Γ model. A tailor-made CFD computationbenchmark case is introduced and used to carry out this review, stressing the par-ticular interest for commodity clusters. Conclusions are drawn from this extensiveseries of analyses and modeling leading to specific recommendations concerning suchtoolbox development and parallel implementation. Key words:
Spectral and mortar element method, C ++ toolbox, MPI, scalability,commodity clusters. This paper provides a detailed performance evaluation of the C ++ toolboxnamed Speculoos (for Spectral Unstructured Elements Object-Oriented System). ∗ Corresponding author.
Email addresses: [email protected] (Roland Bouffanais), [email protected] (Vincent Keller), [email protected] (RalfGruber), [email protected] (Michel O. Deville). Supported by a Swiss National Science Foundation Grant No. 200020–101707
Preprint submitted to Parallel Computing 25 October 2018 peculoos is a spectral and mortar element analysis toolbox for the numer-ical solution of partial differential equations and more particularly for solv-ing incompressible unsteady fluid flow problems [1]. The main architecturechoices and the parallel implementation were elaborated and implemented byVan Kemenade and Dubois-P`elerin [2, 3]. Subsequently, Speculoos’ C ++ codehas been growing up with additional layers enabling to tackle and simulatemore specific and arduous CFD problems: viscoelastic flows by Fi´etier andDeville [4–6], fluid-structure interaction problems by Bodard and Deville [7],large-eddy simulations of confined turbulent flows by Bouffanais et al. [8, 9]and free-surface flows by Bouffanais and Deville [10].It is well known that spectral element methods are amenable easily to paral-lelization as they are intrinsically a natural way of decomposing a geometricaldomain [11] and Chap. 8 of [12].The numerous references previously given and the ongoing simulations basedon Speculoos highlight the achieved versatility and flexibility of this C ++ tool-box. Nevertheless, ten years have passed between the first version of Specu-loos’ code and now, and tremendous changes have occurred at both hard-ware and software levels: fast dual DDR memory, RISC architectures, 64-bitmemory addressing, compilers improvement, libraries optimization, librariesparallelization, increase in inter-connecting switch performance, etc.Back in 1995, Speculoos was commonly compiled and was running on HP, Sil-icon Graphics workstations and also on the Swiss-Tx machine, a commodity-technology based computer with enhanced interconnect link between proces-sors [13]. Currently most of the simulations based on Speculoos are compiledand are running on commodity clusters. The workstation world experienceda technical revolution with the advent of ‘cheap’ RISC processors leading tothe ongoing impressive development of parallel architectures such as massivelyparallel clusters and commodity clusters. As a matter of fact, Speculoos bene-fited from this fast technical evolution as it was originally developed as to runin a single program, multiple data mode (SPMD) on a distributed-memorycomputer. The performance evaluations presented here are demonstrating thecorrelation between the good performances measured with Speculoos and theadequation of this code structure with the current hardware and softwareevolutions in parallel computing.This paper is organized as follows. In Section 2 we introduce the numericalcontext in which Speculoos was initiated, the software aspects related to itsimplementation and the variable-size benchmark test case used for the perfor-mance evaluation presented in the subsequent sections. Section 3 is devoted tothe parallel performance analysis achieved on RISC-based commodity clusters.Finally, in Section 4 we draw some conclusions on the results obtained.2 Speculoos numerical and software context
In this section, is gathered the necessary background information regarding thenumerical method—namely the spectral and mortar element method—, theobject-oriented concept and the parallel paradigm, essential roots embodiedin Speculoos. The final Section 2.3 introduces the simulation used throughoutthis study as benchmark evaluation test case.
The spectral element method (SEM) is a high-order spatial discretizationmethod for the approximate Galerkin solution of partial differential equationsexpressed in weak forms. The SEM relies on expansions on Lagrangian inter-polants bases used in conjunction with particular Gauss–Lobatto and Gauss–Lobatto–Jacobi quadrature rules [14, 15]. As high-order finite element tech-niques, the SEM can deal with arbitrary complex geometry where h -refinementis achieved by increasing the number of spectral elements and p -refinementby increasing the Lagrangian polynomial order within the elements. From ahigh-order precision viewpoint, SEM is comparable to spectral methods as anexponential rate-of-convergence is observed when smooth solutions to regularproblems are sought. C -continuity across element interfaces requires the exact same interpolationin each and every spectral element sharing a common interface. The associ-ated caveat to such conforming configurations is the over-refinement meshinggenerated in low-gradient zones. The adopted remedy to such nuisance is atechnique developed by Bernardi et al. [16] referred to as the mortar ele-ment method. Mortars can be viewed as variational patches of the discontinu-ous field along the element interfaces. They relax the C -continuity conditionwhile preserving exponential rate-of-convergence, and thus allow polynomialnonconformities along element interfaces. The complexity and the size of the large three-dimensional problems tackledby numericists in their simulations require top computational performance ac-cessible from highly parallelized algorithms running on parallel architectures.As mentioned in [2], the implementation of concurrency in Speculoos wasbased on the concept that concurrency is a painful implementation constraintgoing against the high-level object-oriented programming concepts. As a mat-ter of consequence, Speculoos parallelization was kept very low-level. In most3igher-level operations parallelism does not even show up.From a computational viewpoint, systems discretized with a high-order spec-tral element method rely mainly on optimized tensor-product operations tak-ing place at the spectral element level. The natural data distribution for high-order spectral element methods is based on an elemental decomposition inwhich the spectral elements are distributed to the processors available for therun. It is worth noting that for very large computations, the number of spec-tral elements can become relatively important as compared to the number ofprocessors available for the computation. The design of Speculoos makes itpossible to have several elements sitting on a single processor. Nodal valueson subdomain interface boundaries are stored redundantly on each proces-sor corresponding to the spectral elements having this interface in common.Moreover, this approach is consistent with the element-based storage schemewhich minimizes the inter-processor communications. Inter-processor commu-nication is completed by MPI instructions [17].
As a common practice in performance evaluation, it is important to builda tailor-made benchmark based on a numerical simulation corresponding toa concrete situation. Before proceeding to the first step of our performanceevaluation, we have short-listed some key parameters that have the most sig-nificant impact on the performance of our toolbox: single-processor optimiza-tion on the three computer architectures described in Table 1, single-processorprofiling analysis, parallel implementation and scalability (including speedup,efficiency, communication times) and parallel implementation and processordispatching.A test case has been developed for this benchmark and for the parallel bench-marking, see Sec. 3. This test case belongs to the field of CFD and consistsin solving the Navier–Stokes equations for a viscous Newtonian incompress-ible fluid. Based on the problem at hand, it is always physically rewardingto non-dimensionalize the governing Navier–Stokes equations which take thefollowing general form ∂ u ∂t + u · ∇ u = − ∇ p + 1Re ∆u + f , ∀ ( x , t ) ∈ Ω × I, (1) ∇ · u = 0 , ∀ ( x , t ) ∈ Ω × I, (2)where u is the velocity field, p the reduced pressure (normalized by the con-stant fluid density), f the body force per unit mass and Re the Reynolds4umber expressed as Re = U Lν , (3)in terms of the characteristic length L , the characteristic velocity U and theconstant kinematic viscosity ν . The system evolution is studied in the timeinterval I = [ t , T ]. From the physical viewpoint, Eqs. (1)–(2) are derived fromthe conservation of momentum and the conservation of mass respectively. Forincompressible viscous fluids, the conservation of mass also called continuityequation, enforces a divergence-free velocity field as expressed by Eq. (2).Considering particular flows, the governing Navier–Stokes equations (1)–(2)are supplemented with appropriate boundary conditions for the fluid velocity u and/or for the local stress at the boundary. For time-dependent problems,a given divergence-free velocity field is required as initial condition in theinternal fluid domain.All our computations were carried out using two time integrators: the implicitbackward-differentiation formula (BDF) of order 2 for the treatment of theStokes operator and an extrapolation scheme (EX) [18,19] of same order for thenonlinear convective term. One type of pressure decomposition mode, basedon a fractional-step method using pressure correction namely BP1-PC [20–22]is used.Speculoos uses a Legendre SEM [12, 14, 15] for the spatial discretization ofthe Navier–Stokes equations. For the sake of simplicity the same polynomialorder has been chosen in the different spatial directions ( N x = N y = N z = N ).Moreover, to prevent any spurious oscillations in our Navier–Stokes compu-tations, the choice of a staggered P N − P N − interpolation method for thevelocity and pressure respectively, has been made [12, 23]. As a consequenceof this choice of a staggered grid, the inner-element grid for the x -, y - and z -component of the velocity field is a Gauss–Lobatto–Legendre grid made upwith ( N + 1) quadrature (nodal) points and the grid for the pressure is aGauss–Legendre grid made up with [( N −
2) + 1] quadrature (nodal) points,in each spectral element.The test case corresponds to the fully three-dimensional simulation of theflow enclosed in a lid-driven cubical cavity at the Reynolds number of 12 000placing us in the locally-turbulent regime. It corresponds to the case denotedunder-resolved DNS (UDNS) in Bouffanais et al. [8, 9]. The reader is referredto Bouffanais et al. [8, 9] for full details on the numerical method and on theparameters used throughout the present paper.5 Parallel implementation
In the sequel, we will assume that the reader is familiar with the basics ofparameterization on a parallel machine. For a complete introduction to thesenotions we refer the reader to the following references [24, 25].The speedup S of an application on a given parallel machine can be describedas S = Computing time on one processorCPU plus communication times on P processors = T T P + T C . (4)If we suppose that the computing effort strictly scales with P , then T = P T P and the speedup can be written as S = T T P + T C = P T P T P + T C = P γ m /γ a = P / Γ , (5)where γ a = number of operations [MFlop]amount of data to transfer [MWord] , (6)is related to the application and γ m = effective processor performance [MFlops]effective communication bandwidth per processor [MWords] , (7)to the machine, and Γ = γ a /γ m . The reader is referred to [24] for full detailson such parameterisation to tailor commodity clusters to applications. Theefficiency E of a parallel machine is defined by E = SP = 11 + 1 / Γ . (8) Speculoos uses a small amount of main memory. Parallelization is made inorder to reduce the high overall computing time. The number of elements andthe polynomial degrees in the three space directions are denoted by E x , E y ,and E z , and N x , N y , and N z , respectively. The total number of independentvariables per element is therefore n v × ( N x + 1) × ( N y + 1) × ( N z + 1), where n v is the number of vector components per Gauss–Lobatto–Legendre (GLL)quadrature point. In addition, there are E x × E y × E z elements.6 .2 Hardware and software used To perform the Speculoos code benchmark, the machines presented in Table 1have been used.
Name Manufacturer CPU type Nodes Cores Interconnect
Gele
Cray Opteron DC 16 32 SeaStar
Pleiades
Logics Pentium 4 132 132 FE
Pleiades2
Dell Xeon 120 120 GbE
Pleiades2+
Dell Xeon 5150 99 396 GbETable 1Characteristics of the machines used for the benchmark. DC=Dual-Core. FE=FastEthernet. GbE=Gigabit Ethernet.
As mentioned previously, the Speculoos code is written in C ++ , uses BLASoperations and implements the Message Passing Interface (MPI).The PAPI (Performance API) [26] available on the Cray XT3 machine wasused to measure the number O of operations (in GFlops) and the MFlops rateof Speculoos. The VAMOS service available on the three Pleiades clusters [27]maps the hardware related data from the Ganglia monitoring tool [28] withthe application and user related data (from cluster Resource ManagementSystem and Scheduler). We used the most aggressive optimization flag on allmachines ( -O3 flag). The first measurements are done on
Pleiades2 with a fixed problem size, E x = E y = E z = 8; N x = N y = N z = 8; O = 155 . P of processing elements from 1 to 32. The evolution of the runtime(for one time-step), the associated MFlops rate, and the efficiency E are givenin Table 2. The speedup S as a function of the number of processors is plottedin Fig. 1. One observes that with 8 processors a speedup of 7 can be reachedand a speedup of 30 with 45 processors. In this section, the number of processors on a Cray XT3 is kept fixed atthe value P = 4. Then, we modify the polynomial degree and measure theMFlops rate. The MFlops rate performance metric for each process element7 GFlops Runtime (1 step) E E : Efficiency. is shown on Table 3. It increases as the problem size increases. As expected,one deduces that there is a limit on the number of processors that should beused in parallel. E x − E y − E z N x − N y − N z MFlops Walltime − − − − − − − − − − − − − − − − − − − −
10 4150 146.978 − − − −
11 4390 257.36Table 3Evolution of MFlops rate and runtime for one time-step on 4 Cray XT3 dual-CPUnodes as a function of the polynomial degree. P with problem size A more common way to measure scalability, and to overcome Amdahl’s law, isto fix the problem size per processor and to increase the number of processorswith the overall problem size. In other words, one tries to fix Γ that measuresthe ratio between processor needs over communication needs. We show inTable 4 the scalability of Speculoos on the
Pleiades2+ cluster. It was compiledusing MPICH2 and icc C ++ compiler version 9.1e.8 ig. 1. Speedup of Speculoos code on the Pleiades2 (Xeon CPU). E x − E y − E z N x − N y − N z Nodes-Cores Elem/Core Walltime − − − − − − − − − −
16 8 − − (A) E x − E y − E z N x − N y − N z Nodes-Cores Elem/Core Walltime − − − − − − − − − −
16 8 − − (B) Table 4Scalability of Speculoos. Same polynomial degree, same number of elements on eachcomputing node on
Pleiades2+ (Woodcrest) cluster. (A): with 4 MPI threads pernode. (B): with npernode = 2 , two MPI threads per node.
Table 4 (A) shows results obtained when all the 4 cores are active for
P > Γ model CPU usage has been monitored by the VAMOS monitoring service [27] avail-able on the Pleiades clusters. It provides information on the application’sbehavior. The higher the CPU usage is, the better the machine fits to the ap-plication. To perform that monitoring we took the same problem size ( E x = E y = E z = 8 and N x = N y = N z = 8) during the same computing dura-tion (10 hours = 36’000 seconds). The application is run for 10 hours andthe number of time-steps performed during this time is counted. With such amethodology, we ensure that each sample can perform a maximum of calcu-lations in a given amount of time. It is equivalent to set the same number ofiteration for each sample and to measure the walltime.Figure 2 shows the different behavior of Speculoos on the three different Pleiades architectures. The Γ value—introduced in Eq. (5) and, which re-flects the “fitness” of a given application on a given machine [24]—is alsocomputed. Results are reported in Table 5.Using the notations introduced earlier, T , T P , T C , and T L denote the totalwalltime, the CPU time for P processing elements, the time to communicate,and the latency time per iteration step, respectively. Then, T = T P + T C + T L , (9)and the parameter Γ is easily expressed asΓ = T P T C + T L . (10)It is possible to measure the total time T by means of an interpretation ofthe CPU usage plots (see Fig. 2). Indeed, the middleware Ganglia determinesfor every time interval of 20 seconds the average CPU usage (or efficiency E )for each processing element. This information has to be put into relation tothe Speculoos application. This is done via the middleware VAMOS. In theplots in Fig. 2, are added up all the values of E that lie in between x and10 ig. 2. CPU Usage of Speculoos on different machines. Top: Pleiades cluster (CPUusage average 51.05% ,Γ = 1 . Pleiades2 cluster (CPU usage average= 79.24%, Γ = 3 . Pleiades2+ cluster (CPU usage average 61.6%,Γ = 1 . [s] Γ b [MB/s] W [words] T P [s] T C [s] T L [s] Pleiades ∗ Pleiades2 ∗ Pleiades2+ ∗ x + 0 .
01, where x is the percentile represented on the abscissae of the plots.The efficiency E is related to the Γ throughΓ = E − E . (11)What can also be estimated are the network bandwidths b of the GbE switch(between b = 90 and 100 MB/s per link), the network bandwidth of the FastEthernet switch (between b = 10 and 12 MB/s per link) and the latency( L = 60 µ s for both networks). First, a consistency test of those quantitiesis performed. Assuming that the Fast Ethernet switch has a fix bandwidth of b = 12 MB/s, and for the GbE switch b = αb , with α unknown. Anotherunknown is the number of words W that is sent per node to the other nodes,and T C = W/b . Based on the previous assumptions, the three Γ values for thethree machines and the two networks is expressed asΓ = T P W/b + T L , (12)Γ = T P W/b + T L , (13)Γ = T P W/b + T L . (14)These constitute a set of three equations for three unknown variables, namely W , α , and T L . Solving for these variables leads to T L = 1, W = 180 MWords,and α = 8 .
43. The value of b = 101 MB/s corresponds precisely to the onemeasured. This means that the model is well applicable. To study if Speculoos is dominated by inter-node communications, Figure 3shows the result of two runs of the same problem size ( E x = E y = E z = 8 and N x = N y = N z = 8) made respectively on 4 and 8 Woodcrest nodes during thesame period of time (1h = 3600 seconds) and counting the number of iterationsteps. The first sample was launched forcing 2 MPI threads on each node andthe second with 4 MPI threads on each node.12 ig. 3. CPU Usage on the 5100-series SMP node of Pleiades2+ cluster. 16 process-ing elements were required. 8 nodes/2 cores with 2 MPI threads per nodes in theupper case, 4 nodes/4 cores with 4 MPI threads per node in the lower case.
We have to note that the CPU usage (system+user+nice) monitored by Gan-glia is the sum of all the process elements. For instance, for a dual-processormachine, when Ganglia measures 50% CPU usage, it means that each pro-cessor run at 100%. In Figure 3, when 2 MPI threads are blocked per node,we get a CPU usage of 51.13% while 157 iteration loops have been performedduring one hour; when 4 MPI threads run on each node, we get a CPU us-age of 87.25% while only 117 iteration loops have been performed during onehour. Thus, the real CPU usage for the sample with 2 MPI threads per nodeis above 100% (2 cores are unused).
The extensive performance review presented in this paper for the high-orderspectral and mortar element method C ++ toolbox, Speculoos, has shown that13ood performances can be achieved even with relatively common internodenetwork communication systems, available software and hardware resources—small commodity clusters with non-proprietary compilers installed on it.We can conclude that the main implementation choices made a decade ago re-veal their promises. Even though those choices could have been questionableten years ago, they are now in line with the current trend in computer ar-chitecture developments with the generalization of commodity and massivelyparallel clusters.The parallel implementation of Speculoos based on MPI has shown to beefficient. Reasonable scalability and efficiency can be achieved on commodityclusters. The results support the original choices made in Speculoos parallelimplementation by keeping it at a very low-level.One of the goal of this study was to estimate if Speculoos could run on a mas-sively parallel computer architecture comprising thousands of computationalunits, specifically on the IBM Blue Gene machine at EPFL with 4’096 dualprocessor units. The performance of one processor corresponds to approxi-mately half of the performance of one processor on the Pleiades commoditycluster. Each Blue Gene node has 512 MB of main memory. A block with4 × × N = 8 in each space directiontakes 200 MB of main memory. In a first step, one block per node will run onone node. Later, Speculoos will be modified to accommodate one block perprocessor, i.e. two blocks per node. A 4’096 blocks Speculoos case would offerthe opportunity to run very accurate simulations of turbulent flows with morethan half a billion of unknowns. Such a case would well scale on the IBM BlueGene solution. In fact, the point-to-point operations per node do not changewith the number of nodes. The Gigabit-Ethernet network can well handlethe corresponding communications. The all-reduce operations scale logarith-mically with the number of computational units. A special efficient Fat Treenetwork takes care of all multicast communications. As a consequence, largeSpeculoos cases will perfectly scale on EPFL’s Blue Gene machine. Acknowledgements
This research is being partially funded by a Swiss National Science FundationGrant (No. 200020–101707) and by the Swiss National Supercomputing CenterCSCS, whose supports are gratefully acknowledged.The results were obtained on supercomputing facilities at the Swiss NationalSupercomputing Center CSCS and on Pleiades clusters at EPFL–ISE.14 eferences [1] V. Van Kemenade, Incompressible fluid flow simulation by the spectral elementmethod, Tech. rep., “Annexe technique projet FN 21-40’512.94”, IMHEF–DGM, Swiss Federal Institute of Technology, Lausanne (1996).[2] Y. Dubois-P`elerin, V. Van Kemenade, M. Deville, An object-oriented toolboxfor spectral element analysis, J. Sci. Comput. 14 (1999) 1–29.[3] Y. Dubois-P`elerin, Speculoos: an object-oriented toolbox for the numericalsimulation of partial differential equations by spectral and mortar elementmethod, Tech. Rep. T-98-5, EPFL–LMF (1998).[4] N. Fi´etier, Detecting instabilities in flows of viscoelastic fluids, Int. J. Numer.Methods Fluids 42 (2003) 1345–1361.[5] N. Fi´etier, M. O. Deville, Linear stability analysis of time-dependent algorithmswith spectral element methods for the simulation of viscoelastic flows, J. Non-Newtonian Fluid Mech. 115 (2003) 157–190.[6] N. Fi´etier, M. O. Deville, Time-dependent algorithms for the simulation ofviscoelastic flows with spectral element methods: applications and stability, J.Comput. Phys. 186 (2003) 93–121.[7] N. Bodard, M. O. Deville, Fluid-structure interaction by the spectral elementmethod, J. Sci. Comput. 27 (2006) 123–136.[8] R. Bouffanais, M. O. Deville, P. F. Fischer, E. Leriche, D. Weill, Large-eddysimulation of the lid-driven cubic cavity flow by the spectral element method,J. Sci. Comput. 27 (2006) 151–162.[9] R. Bouffanais, M. O. Deville, E. Leriche, Large-eddy simulation of the flow ina lid-driven cubical cavity, Phys. Fluids 19 (2007) Art. 055108.[10] R. Bouffanais, M. O. Deville, Mesh update techniques for free-surface flowsolvers using spectral element method, J. Sci. Comput. 27 (2006) 137–149.[11] P. F. Fischer, A. T. Patera, Parallel spectral element solution of the Stokesproblem, J. Comput. Phys. 92 (1991) 380–421.[12] M. O. Deville, P. F. Fischer, E. H. Mund, High-order methods for incompressiblefluid flow, Cambridge University Press, Cambridge, 2002.[13] R. Gruber, A. Gunzinger, The Swiss-Tx supercomputer project, EPFLSupercomputing Review 9 (1997) 21–23.[14] Y. Maday, A. T. Patera, Spectral element methods for the incompressibleNavier–Stokes equations, State-of-the-Art Survey on Computational Mechanics,A. K. Noor & J. T. Oden Eds., ASME, New-York, 1989, pp. 71–142.[15] A. T. Patera, Spectral element method for fluid dynamics: laminar flow in achannel expansion, J. Comput. Phys. 54 (1984) 468–488.
16] C. Bernardi, Y. Maday, A. T. Patera, A new nonconforming approach todomain decomposition: The mortar element method, Vol. 299 of Pitman Res.Notes Math. Ser., Nonlinear partial differential equation and their applications,Coll`ege de France Seminar, 11 (Paris, 1989–1991), Longman Sci. Tech., Harlow,1994, pp. 13–51.[17] W. D. Gropp, E. Lusk, A. Skjellum, Using MPI: Portable Parallel Programmingwith the Message-Passing Interface, MIT Press, Cambridge, Massachusetts,1999.[18] W. Couzy, Spectral element discretization of the unsteady Navier–Stokesequations and its iterative solution on parallel computers, Ph.D. thesis, no.1380, Swiss Federal Institute of Technology, Lausanne (1995).[19] G. E. Karniadakis, M. Israeli, S. A. Orszag, High-order splitting methods for theincompressible Navier–Stokes equations, J. Comput. Phys. 97 (1991) 414–443.[20] W. Couzy, M. O. Deville, Spectral-element preconditioners for the Uzawapressure operator applied to incompressible flows, J. Sci. Comput. 9 (1994)107–112.[21] J. B. Perot, An analysis of the fractional step method, J. Comput. Phys. 108(1993) 51–58.[22] J. B. Perot, Comments on the fractional step method, J. Comput. Phys. 121(1995) 190–191.[23] Y. Maday, A. T. Patera, E. M. Rønquist, The P N × P N − method forthe approximation of the Stokes problem, Tech. Rep. 92009, Department ofMechanical Engineering, MIT, Cambridge, MA (1992).[24] R. Gruber, P. Volgers, A. DeVita, M. Stengel, T.-M. Tran, Parametrisationto tailor commodity clusters to applications, Future Generation ComputerSystems 19 (2003) 111–120.[25] R. Gruber, T.-M. Tran, Scalability aspects of commodity clusters, EPFLSupercomputing Review 14 (2004) 12–17.[26] Performance API, website, http://icl.cs.utk.edu/papi/index.html (2007).[27] Veritable Application MOnitoring Service, website, http://pleiades.epfl.ch/~vkeller/VAMOS (2006).[28] The Ganglia Monitoring Tool, website, http://ganglia.sourceforge.net (2007).(2007).