[PDF] Simulation-based Optimization and Sensibility Analysis of MPI Applications: Variability Matters

Abstract

Finely tuning MPI applications and understanding the influence of keyparameters (number of processes, granularity, collective operationalgorithms, virtual topology, and process placement) is critical toobtain good performance on supercomputers. With the high consumptionof running applications at scale, doing so solely to optimize theirperformance is particularly costly. Havinginexpensive but faithful predictions of expected performance could bea great help for researchers and system administrators. Themethodology we propose decouples the complexity of the platform, whichis captured through statistical models of the performance of its maincomponents (MPI communications, BLAS operations), from the complexityof adaptive applications by emulating the application and skippingregular non-MPI parts of the code. We demonstrate the capability of our method with High-PerformanceLinpack (HPL), the benchmark used to rank supercomputers in theTOP500, which requires careful tuning. We briefly present (1) how theopen-source version of HPL can be slightly modified to allow a fastemulation on a single commodity server at the scale of asupercomputer. Then we present (2) an extensive (in)validation studythat compares simulation with real experiments and demonstrates our ability to predict theperformance of HPL within a few percent consistently. This study allows us toidentify the main modeling pitfalls (e.g., spatial and temporal nodevariability or network heterogeneity and irregular behavior) that needto be considered. Last, we show (3) how our ``surrogate'' allowsstudying several subtle HPL parameter optimization problems whileaccounting for uncertainty on the platform.

Full PDF

SSimulation-based Optimization and Sensibility Analysisof MPI Applications: Variability Matters

Tom Cornebize, Arnaud Legrand

Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, Franceﬁ[email protected]

Abstract

Finely tuning MPI applications and understanding the inﬂuence of key param-eters (number of processes, granularity, collective operation algorithms, virtualtopology, and process placement) is critical to obtain good performance on su-percomputers. With the high consumption of running applications at scale,doing so solely to optimize their performance is particularly costly. Havinginexpensive but faithful predictions of expected performance could be a greathelp for researchers and system administrators. The methodology we proposedecouples the complexity of the platform, which is captured through statisti-cal models of the performance of its main components (MPI communications,BLAS operations), from the complexity of adaptive applications by emulatingthe application and skipping regular non-MPI parts of the code. We demon-strate the capability of our method with High-Performance Linpack (HPL), thebenchmark used to rank supercomputers in the TOP500, which requires care-ful tuning. We brieﬂy present (1) how the open-source version of HPL can beslightly modiﬁed to allow a fast emulation on a single commodity server at thescale of a supercomputer. Then we present (2) an extensive (in)validation studythat compares simulation with real experiments and demonstrates our ability topredict the performance of HPL within a few percent consistently. This studyallows us to identify the main modeling pitfalls (e.g., spatial and temporal nodevariability or network heterogeneity and irregular behavior) that need to be con-sidered. Last, we show (3) how our “surrogate” allows studying several subtleHPL parameter optimization problems while accounting for uncertainty on theplatform.

Preprint submitted to Elsevier February 16, 2021 a r X i v : . [ c s . D C ] F e b . Introduction Today, supercomputers with 100,000 cores and more are common, and sev-eral machines beyond the 1,000,000 cores mark are already in production. Thesecompute resources are interconnected through complex non-uniform memoryhierarchies and network infrastructures. This complexity requires careful opti-mization of application parameters, such as granularity, process organization,or algorithm choice, as these have an enormous impact on load distribution andnetwork usage. Scientiﬁc application developers and users often spend a sub-stantial amount of time and eﬀort running their applications at diﬀerent scalessolely to tune parameters for optimizing their performance. Whenever actualperformance does not match expectations, it can be challenging to understandwhether the mismatch originates from application misunderstanding or machinemisconﬁguration. Similar diﬃculties are encountered when (co-)designing su-percomputers for speciﬁc applications. A large part of this tuning work couldbe simpliﬁed if a generic and faithful performance prediction tool was available.This article presents a decisive step in this direction.Several techniques have been proposed to predict the performance of a givenapplication on a supercomputer. A ﬁrst approach consists in building a mathe-matical performance model (i.e., an analytic formula) accounting for both plat-form and application key characteristics. However, it is rarely accurate, exceptfor elementary applications on highly regular and well-provisioned platforms,and can thus be merely used to predict broad trends. A more precise approachconsists in capturing a trace of the application at scale and replaying it usinga simulator. This is an eﬀective approach for capacity planning, but since theapplication trace is speciﬁc to a given set of parameters (and even speciﬁc toa given run for dynamic applications that exhibit non-deterministic behaviorsdue to, e.g., the use of asynchronous collective operations), it cannot be usedto study how application parameters should be set for optimizing performance.The main diﬃculty resides in capturing and modeling the interplay betweenthe application and the platform while faithfully accounting for their respective2omplexity. A promising approach recently pioneered in several tools [1, 2, 3]consists in emulating the application in a controlled way so that a platformsimulator governs its execution. Although this approach’s scalability is a pri-mary concern that has already received lots of attention, the accuracy of thesimulation is even more challenging. It is still an open research question sinceEngelmann and Naughton [4] report, for example, an error ranging from 20%to 40% for NPB LU when using 128 ranks.In a previous publication [5], we presented how an application like HPL canbe emulated at a reasonable cost on a single commodity server to study scenar-ios similar to qualiﬁcation runs of supercomputers for the Top500 ranking [6].We also showed how to predict the performance of HPL for a speciﬁc set ofparameters on a recent cluster (running a thousand MPI ranks) within a fewpercent of reality. HPL is particularly challenging to study because it imple-ments several custom non-trivial MPI collective communication algorithms tooverlap communications with computations eﬃciently. It is also particularlysensitive to platform variability (both spatial and temporal). In this article, weconduct an extensive validation study. We show that our approach allows usto consistently predict the real-life performance of HPL within a few percentregardless of its input parameters, thereby showing that application parameterscan be tuned fully in simulation. Throughout this validation, which spannedover two years, we also highlight key issues that may arise when modeling theplatform and should be carefully addressed to obtain reliable predictions. Last,given the sensibility of applications to computing and communication resourcevariability, we showcase how to conduct what-if performance analysis of HPCapplications in a capacity planning context.This article is organized as follows: Section 2 presents the main character-istics of the HPL application and provides information on how the runs areconducted on modern supercomputers. In Section 3, we brieﬂy present the sim-ulator we used for this work, SimGrid/SMPI, and the modiﬁcations of HPL thatwere required to obtain a scalable simulation and some initial validation resultspresented in an earlier work [5]. These results highlight the importance of mod-3 B L UA N allocate and initialize A for k = N to step NB do allocate the panelfactor the panelbroadcast the panelupdate the sub-matrix; Figure 1: Overview of High-Performance Linpack. eling both spatial and temporal variability. In Section 4, we compare simulationresults with real experiments through two typical HPL performance studies thatcover a wider range of application parameters. Section 5 presents how our HPLsurrogate can be used to study and possibly optimize the performance of HPLin the presence of uncertainty on the platform. Section 6 discusses related workand explains how our approach compares with other approaches. Section 7concludes by discussing future work.

2. Background on High-Performance Linpack

HPL implements a matrix factorization based on a right-looking variantof the LU factorization with row partial pivoting and allows for multiple look-ahead depths. In this work, we use the freely-available reference-implementationof HPL [7], which relies on MPI, and from which most vendor-speciﬁc imple-mentations (e.g., from Intel or ATOS) have been derived. Figure 1 illustratesthe principle of the factorization which consists of a series of panel factorizationsfollowed by an update of the trailing sub-matrix. HPL uses a two-dimensionalblock-cyclic data distribution of A , which allows for a smooth load-balancing ofthe work across iterations.The sequential computational complexity of this factorization is ﬂop( N ) = N + 2 N + O ( N ) where N is the order of the matrix to factorize. The timecomplexity on a P × Q processor grid can thus be approximated by T ( N ) ≈ (cid:0) N + 2 N (cid:1) P · Q · w + Θ(( P + Q ) · N ) , where w is the ﬂop rate of a single node and the second term corresponds to the4ommunication overhead which is inﬂuenced by the network capacity and manyconﬁguration parameters of HPL. Indeed, HPL implements several custom MPIcollective communication algorithms to eﬃciently overlap communications withcomputations. The main parameters of HPL are thus:• N is the order of the square matrix A .• NB is the “blocking factor”, i.e., the granularity at which HPL operates whenpanels are distributed or worked on. This parameter inﬂuences the eﬃciencyof the dgemm BLAS kernel, which is the kernel used in the sub-matrix updates,but also the eﬃciency of MPI communications.• P and Q denote the number of process rows and process columns. For thisalgorithm, the total amount of data transfers is proportional to ( P + Q ) .N ,which generally favors virtual topologies where P and Q are approximatelyequal.• RFACT determines the panel factorization algorithm. Possible values are

Crout , left- or right-looking .• SWAP speciﬁes the swapping algorithm used while pivoting. Two algorithmsare available: one is based on a binary exchange (along a virtual tree topology)and the other one is based on a spread-and-roll (with a higher number ofparallel communications). HPL also provides a panel-size threshold triggeringa switch from one variant to the other.•

BCAST sets the algorithm used to broadcast a panel of columns over the processcolumns. Legacy versions of the MPI standard only supported non-blockingpoint-to-point communications, which is why HPL ships with in total 6 self-implemented variants to overlap the time spent waiting for an incoming panelwith updates to the trailing matrix: ring , ring-modified , , , long , and long-modified . The modified versions guarantee thatthe process right after the root (i.e., the process that will become the rootin the next iteration) receives data ﬁrst and does not further participate5 able 1: Typical HPL conﬁgurations. Stampede@TACC Theta@ANL

Rpeak . − . − N , ,

000 8 , , NB P × Q ×

78 32 × RFACT

Crout Left

SWAP

Binary-exch. Binary-exch.

BCAST

Long modiﬁed 2 Ring modiﬁed

DEPTH

Rmax . − . − Duration 2 hours 28 hoursMemory

120 TB 559 TB

MPI ranks 1/node 1/node in the broadcast. This process can thereby start working on the panel assoon as possible. The ring and versions each broadcast along thecorresponding virtual topologies while the long version is a spread and roll algorithm where messages are chopped into Q pieces. This generally leadsto better bandwidth exploitation. The ring and variants rely on MPI_Iprobe , meaning they return control if no message has been fully receivedyet, hence facilitating partial overlap of communication with computations.In HPL 2.1 and 2.2, this capability has been deactivated for the long and long-modified algorithms. A comment in the source code states that somemachines apparently get stuck when there are too many ongoing messages.•

DEPTH controls how many iterations of the outer loop can overlap with eachother. As indicated in the HPL documentation, a depth equal to 1 often givesbetter results than a depth equal to 0 for large problem sizes, but a look-aheadof depth equal to 3 and larger is not expected to bring any improvement.All the previously listed parameters interact uniquely with the interconnec-tion network capability and the MPI library to inﬂuence the overall performanceof HPL, which makes it very diﬃcult to predict precisely. To illustrate the di-versity of real-life conﬁgurations, we report in Table 1 a few ones used for theTOP500 ranking that some colleagues agreed to share with us.The performance typically achieved by supercomputers (

Rmax ) needs to be6ompared to the much larger peak performance (

Rpeak ). The diﬀerence can beattributed to the node usage, to the MPI library, to the network topology thatmay be unable to deal with the intense communication workload, to load imbal-ance among nodes (e.g., due to a defect, system noise,. . . ), to the algorithmicstructure of HPL, etc. All these factors make it diﬃcult to know precisely whatperformance to expect without running the application at scale. Due to the com-plexity of both HPL and the underlying hardware, simple performance models(analytic expressions based on

N, P, Q and estimations of platform characteris-tics as presented in Section 2) may at best be used to determine broad trendsbut can by no means accurately predict the performance for each conﬁguration(e.g., consider the exact eﬀect of HPL’s six diﬀerent broadcast algorithms onnetwork contention). Additionally, these expressions do not allow engineers toimprove the performance through actively identifying performance bottlenecks.For complex optimizations such as partially non-blocking collective communi-cation algorithms intertwined with computations, a very faithful model of boththe application and the platform is required.

3. Emulating HPL with SimGrid/SMPI

In this section, we present an overview of Simgrid/SMPI and of the modi-ﬁcations of HPL required to obtain a scalable simulation and a ﬁrst validationof the simulations. The results of Sections 3.1– 3.4 previously appeared in aconference publication [5] and are included here for completeness.

SimGrid [8] is a ﬂexible and open-source simulation framework that was ini-tially designed in 2000 to study scheduling heuristics tailored to heterogeneousgrid computing environments but was later extended to study cloud and HPCinfrastructures. The main development goal for SimGrid has been to providevalidated performance models, particularly for scenarios making heavy use ofthe network. Such a validation usually consists of comparing simulation pre-dictions with real experiments to conﬁrm or debunk and improve network andapplication models. 7MPI, a simulator based on SimGrid, has been developed and used to sim-ulate unmodiﬁed MPI applications written in C/C++ or FORTRAN [1]. Tothis end, SMPI maps every MPI rank of the application onto a lightweight sim-ulation thread. These threads are then run in mutual exclusion and controlledby SMPI, which measures the time spent computing between two MPI calls.This duration is injected in the simulator as a simulated delay, scaled up ordown depending on the speed diﬀerence between the simulated machine andthe simulation machine.The complex optimizations done in real MPI implementations need to beconsidered when predicting the performance of applications. For instance, the“eager” and “Rendez-vous” protocols are selected based on the message size,with each protocol having its synchronization semantic, which strongly impactsperformance. Another problematic issue is to model network topologies andcontention. SMPI relies on SimGrid’s communication models, where each ongo-ing communication is represented as a single ﬂow (as opposed to a collection ofindividual packets). Assuming steady-state, contention between active commu-nications can then be modeled as a bandwidth sharing problem while accountingfor non-trivial phenomena (e.g., cross-traﬃc interference [9]). The time spentin MPI is thus derived from the SMPI network model that accounts for MPIpeculiarities (depending on the message size), the machine topology, and thecontention with all other ongoing ﬂows. For more details, we refer the inter-ested reader to [1].

HPL relies heavily on BLAS kernels such as dgemm (for matrix-matrix multi-plication). Since these kernels’ output does not inﬂuence the control ﬂow, sim-ulation time can be reduced considerably by substituting these function callswith a performance model of the respective kernel. Figure 2 shows an exampleof this macro-based mechanism that allows us to keep HPL code modiﬁcationsto an absolute minimum. The (1.029e-11) value represents the inverse of theﬂop rate for this compute kernel and is obtained by benchmarking the target8 define HPL_dgemm(layout, TransA, TransB, M, N, K, \alpha, A, lda, B, ldb, beta, C, ldc) ({ \double size = ((double)M)*((double)N)*((double)K); \double expected_time = 1.029e-11*size + 1.981e-12; \smpi_execute_benched(expected_time); \})

Figure 2: Non-intrusive macro replacement with a very simple performance model. nodes. The kernel’s estimated duration is calculated based on the given pa-rameters and passed on to smpi_execute_benched that advances the simulatedclock of the executing rank by this estimate. Skipping compute kernels makesthe content of output variables invalid, but in simulation, only the application’sbehavior and not the correctness of computation results are of concern. Theseminor modiﬁcations to the original source code (HPL comprises 16K lines ofANSI C over 149 ﬁles, our modiﬁcations only changed 14 ﬁles with 286 lineinsertions and 18 deletions) enabled us to simulate the conﬁguration used forthe Stampede cluster in 2013 for the TOP500 ranking (see Table 1) in less than62 hours and using 19 GB on a single node of a commodity cluster (insteadof 120TB of RAM over a 6006 node supercomputer). Further speed-up couldprobably be obtained by modifying HPL further, but our primary interest inthis article is on the prediction quality.Most BLAS kernels have several parameters from which a straightforwardmodel can generally easily be identiﬁed (e.g., proportional to the product ofthe parameters), but reﬁnements including the individual contribution of eachparameter as well as the spatial and temporal variability of the operation arealso possible. In the following, all the simulations have been done with thefollowing model for the dgemm kernel:For each processor p , dgemm p ( M, N, K ) ∼ H ( µ p , σ p )  µ p = α p M N K + β p M N + γ p M K + δ p N K + (cid:15) p σ p = ω p M N K + ψ p M N + φ p M K + τ p N K + ρ p , (1)where H ( µ, σ ) denotes a half-normal random variable with parameters µ, σ ac-counting for the expectation and the standard deviation. The dependency on p α p , β p , . . . , ρ p can be speciﬁcto each node), i.e., the aforementioned spatial variability. The σ p parameter al-lows to account for (short-term) temporal variability, i.e., to model the fact thatthe duration of two successive calls to dgemm with the same parameters M, N, K are never identical. Modeling this variability is important as it may propagatethrough the communication pattern of the application (late sends and late re-ceives). Last, the rationale for using a half-normal distribution rather than anormal distribution stems from the natural positive skewness of compute kernelduration. This model is much more complex than the simple deterministic oneused in Figure 2 but, as we will explain, this complexity is key to obtain goodperformance prediction [5]. However, all other kernels’ duration (which repre-sent a negligible fraction of the overall execution time) have been modeled witha simple deterministic and homogeneous model such as daxpy ( N ) = αN + β . To evaluate the soundness of our approach, we compare several real execu-tions of HPL with simulations using the previous models. We used the Dahucluster from the Grid’5000 testbed. It has 32 nodes connected through a singleswitch by

100 Gbit s − Omnipath links. Each node has two Intel Xeon Gold6130 CPUs with 16 cores per CPU, and we disabled hyperthreading. We usedHPL version 2.2 compiled with GCC version 6.3.0. We also used the librariesOpenMPI version 2.0.2 and OpenBLAS version 0.3.1. Unless speciﬁed other-wise, HPL executions were done using a block size of 128, a matrix of varyingsize (from , to , ), one single-threaded MPI rank per core, a look-ahead depth of 1, and the increasing-2-ring broadcast with the Crout panelfactorization algorithms as this is the combination that led to the best perfor-mance overall. Although this machine is much smaller than top supercomputers,faithfully simulating an HPL execution with such settings is quite challenging.• We used one rank per core to obtain a higher number (1024) of MPI processes.This conﬁguration is more diﬃcult than simulating one rank per node, as(1) it increases the amount of data transferred through MPI, and (2) the10 l l ll l lllll l ll l llll l ll ll llll l lll ll l l llll l ll lll ll l l ll ll ll ll lll l ll l ll l l ll l ll l lllll l ll l llll l ll ll llll l lll ll l l llll ll l ll l l ll l ll l l ll l ll l ll l ll l l ll l ll ll l l lll l (b) Heterogeneous & deterministic (c) Heterogeneous & stochastic(a) Homogeneous & deterministic(b) Heterogeneous & deterministic (c) Heterogeneous & stochastic(a) Homogeneous & deterministic

RealityRealityReality +9%−1%+33%

Matrix rank G f l op / s Figure 3: HPL performance: predictions (dashed lines) vs. reality (solidlines). NB P × Q × PFACT

Crout

RFACT

Right

SWAP

Binary-exch.

BCAST

DEPTH performance is subject to memory interference and network heterogeneity(intra-node communications vs. inter-node communications).• We used a much smaller block size than what is commonly used, which leadsto a higher number of iterations and hence more complex communicationpatterns.• We used relatively small input matrices, which reduces the makespan andmakes good predictions harder to obtain. We now present a ﬁrst quantitative comparison of the prediction with thereality on a typical scenario. We report in Figure 3 the

GFlop s − rate reportedby HPL when varying the matrix size N . Real executions are depicted in solidblack, and the natural variability of the overall performance is illustrated byreporting eight runs of HPL for each matrix size. The dashed line (a), on top,is our ﬁrst attempt to simulate HPL with the naive model (homogeneous anddeterministic for both the kernels and for the network) illustrated in Figure 2.This model overestimates HPL performance by more than

30 % . Modeling theheterogeneity of dgemm (i.e., introducing the dependency on p for dgemm as done inEq (1) but without the temporal variability induced by σ p ) increases signiﬁcantlythe realism of the simulation as the performance is then overestimated by only (dashed line (b)). Finally, we found that adding the temporal variability is11he key ingredient to obtain the last bit of realism. The prediction using thefull-ﬂedged model (dashed line (c)) is extremely close to reality as it slightlyunderestimates the performance by less than and even as little as forthe larger matrices.As illustrated in Figure 3 and explained in our previous work [5], accuratepredictions require careful modeling of both spatial and temporal variability, asthey appear to have a very strong eﬀect on HPL performance. Somehow, thisis expected since HPL is an iterative program that synchronizes through thebroadcast of factorization panels. A single slower or late process will eventuallydelay all the other ones. In the scenario presented in Figure 3, a large fractionof the overall execution time is spent in MPI communications but foremost insynchronizations (induced by late sends and let receives) rather than in actualdata transfers. As a consequence, careful modeling of computations is essential,but careless modeling of the network was enough to obtain good predictions.This article presents an extensive (in)validation study that demonstrates theimportance of careful modeling of the whole platform. This validation study has been carried out over several years (from 2018 to2020). Despite our eﬀorts to keep the experimental setup stable for the sakeof reproducibility, the platform has evolved. The Linux kernel had a minorupdate, from version 4.9.0-6 to version 4.9.0-13, and the BIOS and ﬁrmwareof the nodes have been upgraded. During this time frame, the cluster has alsosuﬀered from hardware issues, like a cooling malfunction on four of its nodes andseveral faulty memory modules that had to be changed. This malfunction hadan enormous impact on the performance of HPL, which signiﬁcantly complicatedour validation study but also makes it more meaningful as it has been conductedon a particularly challenging setup.Our simulation approach makes it possible to predict the performance ofHPL for a new platform state by merely making a new calibration whenever asigniﬁcant change is detected. This ability to reﬂect in simulation a platform12 ormal cooling Problematic cooling1e+05 2e+05 3e+05 4e+05 5e+05 1e+05 2e+05 3e+05 4e+05 5e+050100002000030000

Matrix rank G f l op / s Reality Simulation NB P × Q × PFACT

Crout

RFACT

Right

SWAP

Binary-exch.

BCAST

DEPTH Figure 4: HPL performance: predictions vs. reality (eﬀect of the cooling issue on the nodesdahu-{13,14,15,16}). change is illustrated in Figure 4 which, similarly to Figure 3 (acquired in March2019), showcases the inﬂuence of matrix size on the performance but at diﬀerentperiods. The left plot represents the normal state of the cluster (in September2020), whereas the right plot has been obtained (in March-April 2019) when 4of the 32 nodes had a cooling issue which lowered their performance by about10%. In all cases, we consistently predict performance within a few percent andperforming a new dgemm calibration on these four nodes was all that was neededto reﬂect this platform change in the simulation.This result illustrates both the faithfulness of our simulations and a poten-tial use case for predictive simulations: a discrepancy between the reality andthe predictions can sometimes indicate a real issue on the platform (similarsituations have already been reported in [1]).

4. Comprehensive Validation Through HPL Performance Tuning

This section reports a few typical performance studies involving HPLthrough both real experiments and simulations. The comparison of both ap-proaches allows us (1) to cover a broader range of parameters than solely ma-trix size as done in our earlier work, (2) to evaluate how faithful to reality oursimulations are even in suboptimal conﬁgurations, and (3) to report the maindiﬃculties encountered when conducting such a study.13 a) Illustrating the eﬀect of the two MPI calibration methods. The optimistic method merely sam-ples message size smaller than and extrapolates for larger sizes. Unfortunately, for messageslarger than

160 MB , the eﬀective bandwidth signiﬁcantly drops. The more realistic calibration mea-sures MPI communication duration for messages up to while injecting some computation loadin the background.

Optimistic calibrationOptimistic calibrationRealisticcalibrationRealisticcalibration P G ﬂ op / s Reality Simulation (optimisticnetwork calibration) Simulation (realisticnetwork calibration) N , NB P × Q PFACT

Crout

RFACT

Right

SWAP

Binary-exch.

BCAST

DEPTH (b) HPL performance: predictions vs. reality (testing all the possible geometries for 960 MPI ranks). Figure 5: The ﬁrst (optimistic) network calibration gave poor predictions for very elongatedgeometries while the improved calibration provides perfect predictions. .1. Evaluating the Inﬂuence of the Geometry Figure 5(b) illustrates the inﬂuence on the performance of the geometry ofthe virtual topology ( P and Q ) used in HPL. As expected, geometries that aretoo distorted lead to degraded performance. All the HPL parameters were ﬁxed(matrix rank is ﬁxed to , and the other parameters are the same as inSection 3.4) except for the geometry as we evaluate all the pairs ( P , Q ) such that P × Q = 960 . We used only 30 nodes instead of 32 to cover a larger number ofgeometries, as has more divisors than .As in all our previous studies, we report both the predicted performanceand the one measured in reality. Like the comparisons presented in the previoussection, the simulation was done with the dgemm model from Eq (1) (stochastic,heterogeneous, and polynomial) and the simplest linear models for the otherkernels. In our ﬁrst simulation attempt that relied on a relatively simple net-work model (deterministic yet piecewise-linear to account for protocol switch)depicted on the leftmost plot of Figure 5(a), we obtained the unsatisfying or-ange line on top of Figure 5(b) for the prediction. The simulations with thesmallest value of P had relatively large prediction errors, with a systematicover-estimation that reaches up to +50% for the × and × geometries.A qualitative comparison of the execution traces obtained in reality and simula-tion showed that the broadcast phases’ duration was greatly underestimated insimulation. We found out that with such elongated geometries, the message sizeis signiﬁcantly larger than what we had used in our calibration, and the per-formance surprisingly and signiﬁcantly drops for such size (compare with therightmost plots of Figure 5(a)). This performance drop is explained by poor op-timization of the DMA locking mechanism in the Inﬁniband network layer [10].A similar performance drop also happens for intra-node communications thatpoorly manage the caches above a given size. Furthermore, the communicationpatterns generated by HPL during the ring broadcast are signiﬁcantly impactedby the busy waiting of HPL that intensively calls MPI_Probe and dgemm on smallsub-matrices. Our initial procedure for calibrating the network did not capturethis phenomenon since we did not inject any additional CPU load.15 rg 1rM 2rg 2rM Lng LnM 1rg 1rM 2rg 2rM Lng LnM

Depth: 0 Depth: 1 b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x Swap G ﬂ op / s Block size: 128

Block size: 256

Mode

Reality Simulation

Error

0% - 5% 5% - 10% > 10%

Figure 6: Inﬂuence of HPL conﬁguration on the performance (factorial experiment). Param-eters have been reorganized based on their inﬂuence on performance to improve readibility.The boxed conﬁguration corresponds to the one boxed in Figure 3.

We addressed this problem by improving our network calibration procedure:(1) we use a distinct model for local and remote calibrations, (2) we sample themessage sizes in a larger interval, and (3) we add calls to dgemm and

MPI_Iprobe between each call to

MPI_Send and

MPI_Recv . The goal was to make the calibra-tion environment more similar to what happens in HPL. The resulting networkmodel is illustrated in the rightmost plots of Figure 5(a). This more realisticnetwork model solved every previous misprediction and allows us to producevery faithful simulations (purple line on Figure 5(b)), which are now a few per-cent of the reality regardless of the geometry. This ﬁgure also illustrates theinﬂuence of the geometry on overall performance since there is almost a factorof ten between the worst conﬁguration ( × ) and the best one ( × ).Although it is not surprising to see that the geometries which are as square aspossible lead to better performance as they minimize the overall amount of datamovements, it is interesting to observe the asymmetric role of P and Q in theoverall performance (smaller values for P lead to better performance) and whichcan be explained by the structure of the collective operations but requires aclose look at the code. 16 .2. Optimizing the HPL Conﬁguration Through a Factorial Experiment Although geometry is among the most important parameter to tune, sixother parameters control the behavior of HPL. In Figure 6, we compare the per-formance reported by HPL when ﬁxing the matrix rank to , and varyingthe following parameters: block size (128 or 256), depth (0 or 1), broadcast (thesix available algorithms), swap (the three available algorithms). The geometrywas ﬁxed to P × Q = 32 ×

32 = 1024 as it is optimal (the simpler calibrationprocedure and the network model depicted on the leftmost plot of Figure 5(a)were thus used). The parameters pfact and rfact (panel factorization) wererespectively ﬁxed to

Crout and

Right , as they had nearly no inﬂuence on HPLperformance in our early experiments.Figure 6 depicts the 72 parameter combinations we tested. These param-eters account for up to

30 % of variability in the performance, which is lessimportant than the geometry but is still quite signiﬁcant. For 61 of them, theprediction error is lower than . Only two combinations have shown a largeerror of approximately

15 % , obtained with a block size of 256, a depth of 1, the broadcast algorithm, and either the long or the mix swap algorithm. Thisdemonstrates the soundness of our approach, as our predictions are reasonablyaccurate most of the time. This experiment conﬁrms that, although the predic-tion of HPL performance for a given parameter combination has a systematicbias, the error remains within a few percent most of the time. Therefore, thissurrogate is good enough for parameter tuning and should be considered whenpreparing a large-scale run.While testing all the parameter combinations is the safest method to dis-cover the combination that provides the highest performance, its cost can beprohibitive due to the high number of combinations. An alternative often usedin practice is to explore only a small subset of the parameter space and to an-alyze variance (ANOVA) to identify the parameters with the more substantialeﬀect on performance and then select the appropriate combination. We appliedthis procedure on samples of both datasets (the one obtained from real runsand the one obtained in simulation). In both cases, the two parameters with17he highest eﬀect were the block size NB and the depth , as shown in Figure 6,followed by bcast and swap . The best combinations selected in both cases werealso identical, demonstrating once again the faithfulness of our simulation ap-proach and how it can be used to reduce the experimental cost of parametertuning. Accurately predicting the performance of an application is not a trivial task.Discrepancies between reality and simulation can be multiple: the platform mayhave changed (e.g., the cooling issue that aﬀected four nodes in Section 3.5), themodel could be inaccurate (e.g., the homogeneous and deterministic dgemm modelis too simple as in [5]) or not correctly calibrated (e.g., the calibration proceduredoes not cover the appropriate parameter space, or the experimental conditionsare too diﬀerent as in Section 4.1). As expected in any serious investigation ofmodel validity, our validation study is not a mere collection of positive cases.Instead, it is the result of a thorough (we extensively covered the HPL parameterspace) attempt to invalidate our model as well as explanations on how we didso. By meticulously overcoming each of these issues, we have demonstrated theability of our approach to produce very faithful predictions of HPL performanceon a given platform.

5. Sensibility Analysis in What-if Scenarios

We have shown that many typical HPL case studies could be conductedin simulation. However, their conclusions (optimal geometry and parameters)are speciﬁc to the cluster we used and they require a precise model of severalaspects of the target cluster, which may not be possible at early experimen-tal stages. In particular, only a few cluster nodes may be available at ﬁrstand the whole cluster model should then be constructed from a limited set ofobservations and carefully extrapolated. This section shows how typical what-if simulation studies should be conducted given such uncertainty. Section 5.1presents a generative model of node performance that can easily be ﬁt from18aily measurements and used to produce a similar platform. This model is usedto quantify the importance on overall performance of temporal variability of the dgemm kernel in Section 5.2 and of spatial variability of nodes in Section 5.3. Inparticular, we show how to study the eﬃciency of a simple slow node eviction strategy. Finally, we study in Section 5.4 the inﬂuence of the physical networktopology on overall performance. Most of these studies are particularly diﬃcultto conduct through real experiments because of the diﬃculty to ﬁnely controlthe platform.

As we have seen in Section 3.2, the performance of nodes exhibits severalkinds of variability: i) a spatial variability (between nodes) ii) a “short-term”temporal variability (the one experienced within an HPL run) but also iii) a“long term” temporal variability (from a day to another). As illustrated in Sec-tion 3.4, accounting for the ﬁrst two kinds of variability is essential but duringour investigation of the simulation validity, which spanned over several months,we also had to deal with the fact that the node performance from a day to an-other could signiﬁcantly vary, thereby making our comparisons between a realexperiment and the simulation driven by model obtained with past measure-ments sometimes irrelevant.This section explains how all sources of variability can be accounted for in asingle uniﬁed model. From our observations, we assume that on a given node p and a given day d , the duration of the dgemm kernel can be modeled as follows: ∀ M, N, K, dgemm p,d ( M, N, K ) ∼ H ( α p,d M N K + β p,d , γ p,d M N K ) (2)Compared to the model (1), this model includes the daily variability but dropsthe complexity of a full-ﬂedged polynomial. Such complexity may be importantwhenever trying to model a particular platform. However, when performingsensibility analysis, a simpler model is preferred, especially as not all terms ofthe polynomial may be statistically signiﬁcant. In this model, the short-termtemporal variability stems from the γ p,d term while the average performance19f the node stems from the α p,d and β p,d terms, which we gather in a single3-dimensional vector µ p,d = ( α p,d , β p,d , γ p,d ) . (3)Now, since every machine is unique it is natural to assume that for each machine: ∀ d, µ p,d ∼ N ( µ p , Σ T ) (4)In this model, µ p accounts for the average performance of the machine p , while Σ T accounts for its day-to-day variability. From our observation we had noparticular reason to assume that this variability was diﬀerent from a machineto another, hence, Σ T is not indexed by p but global to all machines. However,the parameters α p,d , β p,d , γ p,d are generally correlated to each others, hence Σ T is full covariance matrix to account for interactions. The choice of a Normaldistribution is natural since it is the simpler distribution that accounts for aspeciﬁc mean and variance, but we will discuss its relevance later in this section.Finally, we need to account for the spatial variability, which we propose tomodel as follows: ∀ p, µ p ∼ N ( µ, Σ S ) (5)Again, in such a model µ accounts for the machines’ average performance while Σ S accounts for the (weak) heterogeneity. This hierarchical model is depictedin Figure 7.The relevance of model (2) has already been illustrated in Section 3.4, 3.5,and 4 but the relevance of models (4) and (5) requires some attention. Fig-ure 8(a) represent the empirical distribution of µ p,d = ( α p,d , β p,d , γ p,d ) (theresult of the linear regression) for the 32 nodes of the Dahu cluster on 40 diﬀer-ent days from November 2019 to February 2020. The distribution for each nodeappears approximately normal and passed a Shapiro-Wilk normality test. Al-though the distribution of the β p,d appears slightly skewed toward larger valuesand one of the nodes (the one with the larger α p,d ) stands out, there is no goodreason for using a more complex distribution than a Gaussian one. Althoughthe correlation between α , β , and γ is very weak, it appeared to be statisti-20 p Σ T N P nodes µ Σ S t p,d ( M, N, K ) M, N, K H µ p,d = ( α p,d , β p,d , γ p,d ) µ p,d N D days P nodes Figure 7: Generative model of kernel duration accounting for the spatial ( Σ S ), long-term( Σ T ) and short-term variability ( γ p,d ). The shaded node represents observed variables anddiamond node represents deterministic variables, while non-shaded nodes represent latentvariables. The solid node is the variable which is estimated when conducting (in)validationstudies while the dashed ones are useful when conducting sensibility analysis and extrapolatingto an hypothetical cluster. cally signiﬁcant (most ellipses are slightly tilted toward North-East), hence afull variance matrix is needed (at least for Σ T ).It is thus easy to estimate µ p and Σ T by averaging over the µ p,d of eachnode, and then to estimate µ and Σ S by averaging over all the nodes. Thismoment-matching method is simple and provides very good estimates for µ , Σ T , and Σ S because we have enough measurements at our disposal and becauseit is particularly suited to the Gaussian modeling assumption. Should morecomplex models (e.g., a mixture to account for “outlier” nodes or a SkewNor-mal distribution to account for the distribution’s skewness) be used, a generalBayesian sampling framework like STAN [11] would be more adapted. Suchframeworks allow to easily specify hierarchical generative models like the onepresented in Figure 7 and to draw samples from the posterior distribution of µ , Σ T , and Σ S , which can be used to generate realistic µ p,d values for a newhypothetical cluster easily.Such a process is depicted in Figure 8(b) where hypothetical regression pa-rameters for 16 nodes have been generated. Comparing such synthetic data21 e a l d a t a α β α γ (a) Distribution of α , β , and γ (observations on × CPUs from November 2019 to February2020). S y n t h e t i c d a t a α β α γ (b) Distribution of α , β , and γ (synthetic data for 16 CPUs). Figure 8: Distribution of the regression parameters for around 20 dgemm calibrations made oneach of the 32 nodes. Each color/ellipse corresponds to a diﬀerent CPU. R e a l d a t a α β α γ (a) Distribution of α , β , and γ (observations on × CPUs from October to November 2019). S y n t h e t i c d a t a α β α γ (b) Distribution of α , β , and γ (synthetic data for 16 CPUs). Figure 9: Distribution of the regression parameters for around 20 dgemm calibrations made oneach of the 32 nodes. 4 of these nodes had a cooling problem, leading to longer and morevariable durations. Each color/ellipse corresponds to a diﬀerent group of CPUs. α p seems a bit overestimated (the spread along the x-axis is larger). Thiscan be explained by the fact that one of the nodes seemed to be signiﬁcantlyslower (with much larger α p ), which artiﬁcially increased the spatial variability.Second, as expected from a Gaussian model, the distributions of the β p,d aresymmetrical whereas there was a slight negative skew in the original samplesbut this should be of little signiﬁcance for our study. The distributions of the γ p,d however are particularly realistic.We also illustrate the generality of this model with the data from Figure 9(a).These measurements were obtained from October to November 2019 where thecluster was less stable and where some nodes particularly misbehaved. Threenodes (in orange, hence a total of 6 CPUs) are distinguished from the 28 oth-ers (in green) and have lower performance (higher values for α , β , and γ ) andone node (in blue) is particularly unstable. Although this last node may beconsidered too abnormal to represent anything, it would be reasonable to as-sume that a larger cluster would present at least the two kinds of behaviors(green for stable nodes, and orange for slower nodes). The higher layer of themodel in Figure 7 should then be replaced by a mixture of normal distributions(whose weights would then be sampled from a Dirichlet distribution). Again,hypothetical regression parameters for 16 CPUs have been generated with sucha process on Figure 9(b) are very similar, although diﬀerent, to the originalmeasurements.Overall this model is therefore of excellent quality and can be used to gener-ate large conﬁgurations very easily and evaluate the inﬂuence of diﬀerent kindsof variability on the performance of HPL. dgemm ’s Temporal Variability In Section 3.4, we could highlight the importance of accounting for temporalvariability of the dgemm kernel to obtain faithful HPL predictions. To the best of23ur knowledge, HPL developers and experts are often aware of this inﬂuence (orat least suspect it). However, they have never fully quantiﬁed it since design-ing and performing real experiments to evaluate this would be quite diﬃcult.Although increasing this variability wouldn’t be too hard, reducing it would beparticularly complicated. This can however easily be done through simulationusing the hierarchical model of the previous section. In our experiments, the or-der of magnitude of the temporal variability with respect to actual performance(i.e., the ratio between γ p,d and α p,d in Equation (2)) was around 3%. Thismay be a “normal” value or could be considered too high and possibly improvedby better controlling thread mapping or Operating System noise. Such a taskcan be quite tedious and knowing how much performance gain can be expectedbeforehand is thus quite useful. In this section, we study the inﬂuence of thisvariability by generating 10 cluster scenarios using the previous model (as inFigure 8), comprising 1,024 nodes each, but by constraining γ p,d to be equalto γ.α p,d with γ ∈ [0 , . , which represents the coeﬃcient of variation of the dgemm kernel. We evaluate the performance of HPL with one multi-threadedMPI rank per node, a block size of 512, a look-ahead depth of 1. We used the increasing-2-ring broadcast with the Crout panel factorization algorithms and P × Q = 8 × and we tested matrix sizes ranging from 100,000 to 500,000. Letus denote by T ( N, C i , γ ) the performance of HPL when factorizing a matrix ofrank N on cluster C i with a temporal variability of γ . The overhead for thisconﬁguration is the ratio O ( N, C i , γ ) = E [ T ( N, C i , γ )] T ( N, C i , − . Each bubble in Figure 10 represents one such overhead. For any γ , this overheadappears to be negligible for small matrices and to increase and ﬂatten when N grows large. In most TOP5000 qualiﬁcation runs, the matrix is made as largeas possible and the overhead would thus appear to grow roughly linearly with γ . On a new cluster, a simple statistical evaluation of the nodes’ performanceusing the model of Section 5.1 would thus be a good ﬁrst diagnosis of whethertrying to decreasing temporal variability is a promising tuning target or not.24 %2%4%6%8% 2% 4% 6% 8% 10% DGEMM coe ﬃ cient of variation O v e r head on H P L du r a t i on ( % ) Matrix rank

Figure 10: The overhead on HPL duration appears to be linear in dgemm temporal variability.Although it is negligible for small matrices, it severely inﬂates for larger matrices.

Although we showed in Section 3.4 that temporal variability could accountfor about 9% of performance, spatial variability was even more important as itwas responsible for 22% of overhead compared to a fully homogeneous cluster.In practice, the replacement of a few nodes may be possible but such spatialvariability is expected and common [12] and a workaround would have to befound. A common approach consists in dropping out a few of the slowest nodes.Indeed, since the matrix is evenly divided between the nodes, the computationinevitably progresses at the speed of the slowest node. However, removing theslowest nodes also decreases the overall processing capability and impacts thevirtual topology’s geometry (the P and Q parameters of HPL). Such adjustmentis often done by trial and error and is all the more tricky as temporal variabilityand uncertainty from real experiments come into play. In this section, we showhow such a subtle trade-oﬀ can be studied in simulation.Using the model from Section 5.1, we generate 10 mildly heterogeneous 256node clusters (i.e., where nodes are similar to the ones of our cluster when op-erating in the normal state as in Figure 8(a)) and we study the performanceobtained when removing 1 to 16 of the slowest nodes. When removing nodes, thegeometry should be adjusted depending on how the number of remaining nodes25 ×604×624×63 5×495×505×514×64 1×2401×2521×256 10×2410×25 11×2211×23 12×2012×21 13×1914×18 15×1615×17 16×1516×16 17×15 18×14 19×13 2×1202×1262×128 20×1221×12 22×1123×11 24×1025×10 27×9 3×803×813×823×843×85 31×832×8 5×48 6×406×416×42 7×357×36 8×308×318×32 9×279×28 Number of nodes removed O v e r head on H P L du r a t i on ( % ) Figure 11: Inﬂuence of the number of nodes on the performance of HPL. The geometry of thevirtual topology is particularly inﬂuent and it appears that P × Q conﬁgurations with a small P perform signiﬁcantly better then those with a larger P . Each conﬁguration is summarizedthrough the average overhead over the 10 clusters and errorbars represent a 95% conﬁdenceinterval. decomposes in prime factors. As observed in Figure 5(b), having P ≈ Q is gener-ally a good idea to reduce the total amount of communication. However it maybe counter-productive for a given broadcast or swap algorithm that serializescommunications. Figure 11 shows the average (over the 10 clusters) overheadfor a matrix of rank 250,000 compared to the best performance obtained usingthe whole cluster. We group the diﬀerent P × Q decompositions and order themby increasing P . Again, we use the and Binary-exch algorithms, whichare among the best conﬁgurations according to the study of Section 4.2. Itappears that the × geometry now achieves the best trade-oﬀ between thetotal amount of communications and how well they overlap with each other.The optimal conﬁguration for each number of nodes is boxed in Figure 11. Itreveals that there is not much to gain, probably because of the mild spatialheterogeneity of our cluster, but that optimizing the virtual topology is particu-larly important. Figure 12 investigates how this overhead for the best geometryand node selection also depends on the matrix rank. It appears that in thisscenario, except for very small matrices, removing nodes cannot help improvingperformance. Note that the overhead for × Q conﬁgurations with a matrix26 ×604×614×624×63 5×495×505×51 -5%0%5% 1 4 6 8 11 12 16 Number of nodes removed O v e r head on H P L du r a t i on ( % ) Matrix rank

Figure 12: Inﬂuence of node removal on performance while taking into account the matrixrank. Due to the mild heterogeneity of these scenarios, evicting nodes brings no beneﬁt. rank 200,000 appears to behave diﬀerently from what happens for other matrixsizes. This surprising eﬀect probably arises from a subtle combination of matrixsize and virtual topology. We could indeed observe on our cluster that suchconﬁgurations had a weakly but signiﬁcantly worse performance than the otherconﬁgurations. Such interaction also explains why designing a faithful analyticalmodel of HPL is so diﬃcult and why a full simulation of the whole applicationis generally required. Although absolute performance should be taken with agrain of salt when studying such subtle eﬀects, they are easily overlooked whenconducting real experiments. In this particular small scale mild heterogeneityscenario, there is thus no gain in removing nodes but, as illustrated in Figure 13where we increased the spatial variability by a factor of 4, this may be a relevantapproach. A multimodal spatial heterogeneity (as in Figure 9) would certainlylead to much more signiﬁcant gains. This sensibility analysis shows how, for agiven supercomputer, a simple statistical evaluation of the spatial heterogeneityallows evaluating whether spatial variability is a promising tuning target or not.

Finally, since virtual topology and communications appear to signiﬁcantlyinﬂuence the overall performance, one may wonder how much the physical topol-ogy inﬂuences the performance. Indeed, several recent articles [13, 14] reportthat interconnect networks are often oversized compared to the actual need of27 ×604×614×624×63 5×495×505×51 -5%0%5% 1 4 6 8 11 12 16

Number of nodes removed O v e r head on H P L du r a t i on ( % ) Matrix rank

Figure 13: Inﬂuence of node removal on performance in a stronger heterogeneity scenario(extrapolation of our test cluster when it had a cooling problem on 4 of its nodes). Removing6 to 12 nodes our of 256 nodes may bring substential improvement and such optimizationwould therefore be worth investigating. applications and that turning oﬀ some switches could sometimes go completelyunnoticed by end-users. In this section we consider ten 256 node clusters withvariable node performance (as in Figure 8) interconnected by a 2-level fat-treeand quantify by how much performance degrades when the top-tier switchesare gradually deactivated. More formally, we use a (2;32,8;1,N;1,8) fat-treewith N ∈ { , , , } . Figure 14 depicts this degradation as a function of matrixsize. As one could expect, the impact is more signiﬁcant for smaller matrixsizes (where the execution is more network bound). Although removing oneswitch leads to absolutely no visible performance loss, removing two or threeswitches can have a dramatic eﬀect. Again, such degradation depends on thebroadcast and swap algorithms and may be slightly mitigated. To the best ofour knowledge, it is the ﬁrst time such sensibility analysis is conducted faith-fully. Generating random node conﬁgurations allows avoiding potential bias, inparticular against perfectly homogeneous scenarios. We believe such a tool canbe quite useful in the earlier steps of a supercomputer design when performingcapacity planning to adjust the network capacity to a given cost and powerenvelope. 28 %10%20%30% 1 2 3 Number of top-level switches removed O v e r head on H P L du r a t i on ( % ) Matrix rank

Figure 14: Inﬂuence of the physicical topology on the overall performance. It is possible toremove up to 2 of the top-level switches without signiﬁcantly hurting performances for largematrices. Beyond this point, communications become the main performance bottleneck.

6. Related Work

A ﬁrst approach for estimating the performance of applications like HPLconsists in statistical modeling the application as a whole [15]. By running theapplication several times with small and medium problem sizes (of a few iter-ations of large problem sizes) and using simple linear regressions, it is possibleto predict its makespan for larger sizes with an error of only a few percent anda relatively low cost. Unfortunately, the predictions are limited to the sameapplication conﬁguration and studying the inﬂuence of the number of rows andcolumns of the virtual grid, or the broadcast algorithms requires a new modeland new (costly) runs using the whole target machine. Our attempts to build ablack-box analytical model (involving, polynomials, inverse, and logarithms of P and Q ) of HPL from a limited set of observations always failed to provide a faith-ful model with decent prediction and extrapolation capabilities. Furthermore,this approach does not allow studying what-if scenarios (e.g., to evaluate whatwould happen if the network bandwidth was increased or if node heterogeneitywas decreased) that go beyond parameter tuning.Simulation provides the details and ﬂexibility missing to such a black-boxmodeling approach. Performance prediction of MPI applications through simu-lation has been widely studied over the last decades but two approaches can be29istinguished in the literature: oﬄine and online simulation.With the most common approach, oﬄine simulation , a trace of the appli-cation is ﬁrst obtained on a real platform. This trace comprises sequences ofMPI operations and CPU bursts and is given as an input to a simulator thatimplements performance models for the CPUs and the network to derive pre-dictions. Researchers interested in ﬁnding out how their application reacts tochanges to the underlying platform can replay the trace on commodity hard-ware at will with diﬀerent platform models. Most HPC simulators availabletoday, notably BigSim [16], Dimemas [17] and CODES [18], rely on this ap-proach. The main limitation of this approach comes from the trace acquisitionrequirement. Not only is a large machine required but the compressed traceof a few iterations (out of several thousands) of HPL typically reaches a fewhundred MB, making this approach quickly impractical [19]. Worse, tracing anapplication provides only information about its behavior of a speciﬁc run: slightmodiﬁcations (e.g., to communication patterns) may make the trace inaccurate.The behavior of simple applications (e.g., stencil ) can be extrapolated fromsmall-scale traces [20, 21] but this fails if the execution is non-deterministic,e.g., whenever the application relies on non-blocking communication patterns,which is, unfortunately, the case for HPL.The second approach discussed in the literature is online simulation . Here,the application is executed (emulated) on top of a platform simulator that de-termines when each process is run. This approach allows researchers to studydirectly the behavior of MPI applications but only a few recent simulators suchas SST Macro [2], SimGrid/SMPI [8] and the closed-source xSim [3] support it.To the best of our knowledge, only SST Macro and SimGrid/SMPI are matureenough to faithfully emulate HPL. In this work, we decided to rely on SimGridas its performance models and its emulation capabilities are quite solid but thework we present would a priori also be possible with SST. Note that the HPLemulation we describe in Section 3.2 should not be confused with the applicationskeletonization [22] commonly used with SST and more recently introduced inCODES. Skeletons are code extractions of the most important parts of a com-30lex application whereas we only modify a few dozens of lines of HPL beforeemulating it with SMPI. Some researchers from Intel unaware of our recentwork recently applied the same methodology as the one we proposed in [5] toboth Intel HPL and OpenHPL in the closed-source CoFluent simulator [23].To the best of our knowledge, their work reports two faithful predictions fortwo large-scale supercomputers but without investigating at all the impact ofvariability, heterogeneity, nor of communications as we do in this article. Fi-nally, it is important to understand that the approach we propose is intendedto help studies at the whole machine and application level, not the inﬂuence ofmicroarchitectural details as intended by gem5 [24] or MUSA [25].

7. Conclusion

HPC application developers implement many elaborate algorithmic strate-gies (non-blocking collective operations, iteration look-ahead, etc.) whose im-pact on performance is often dependent on both the input workload and thetarget platform. This structure makes it very diﬃcult to model and accuratelyforecast the overall application performance, and many HPC application devel-opers and users are often left with no other option but to study and tune theirapplications at scale, which can be very time- and resource-consuming. We be-lieve that being capable of precisely predicting an application’s performance ona given platform is useful for application developers and users (e.g., to evaluatethe scalability of the application on a given machine) and will become invaluablein the future as it can, for example, help computing centers with deciding whichone of the envisioned technologies for a new machine would work best for agiven application or if an upgrade of the current machine should be considered.Simulation is an eﬀective approach in this context and SimGrid/SMPI haspreviously been successfully validated in several small-scale studies with simpleHPC benchmarks [1, 26]. In an earlier work [5], we have explained how SMPIcould be used to eﬃciently emulate HPL. The proposed approach only requiresminimal code modiﬁcations and applies to any application whose behavior does31ot strongly depend on data-dependent intermediate computation results. Al-though HPL is not a real application, it is quite optimized from an algorithmicpoint of view and its behavior can be controlled through 6 diﬀerent parameters(granularity, geometry of the virtual topology, broadcast/swapping/factoriza-tion algorithm, and the number of concurrent iterations). HPL features clas-sical optimization techniques such as heavily relying on

MPI_Iprobe to overlapcommunication with computations, making it particularly challenging both interms of tuning and simulation.In this article (Section 3 and 4), we present an extensive validation studywhich covers the whole parameter space of HPL. Our study emphasizes theimportance of carefully modeling (1) the platform heterogeneity (not all nodeshave exactly the same performance), (2) the short-term temporal variability(e.g., system noise) for compute kernels as it may propagate in communica-tion patterns, and (3) the complexity of MPI (performance often wildly diﬀersbetween small and large messages and between intra-node and extra-node com-munications). We show that disregarding any of these aspects may lead to wildlyinaccurate predictions even on an application as regular as HPL. By buildingon a few well-identiﬁed micro-benchmarks of the BLAS and MPI, we show thatthese aspects can be well modeled, which allows us to systematically predictthe overall performance of HPL within a few percent. Our experimental resultsspan over two years and we report situations (in Section 3.5 and 4.1) where thesimulation helped us to identify performance regression or anomalies incurredby the platform when the prediction did not match the real experiments.We show (in Section 4) how this faithful surrogate can be used to evaluate thesigniﬁcance of application parameters and tune them accordingly solely throughsimulations. We also propose a generative model for the compute nodes’ per-formance that can easily be ﬁt from daily measurements and used to producesynthetic platforms similar to the ones at hand. We show (in Section 5) howthis model, which allows us to easily control temporal and spatial variability,can feed our simulations to assess the impact of variability on the performanceof the application or of mitigation strategies (e.g., the eviction of the slower32odes). Likewise, the simulation allows to easily assess the inﬂuence of thephysical network on the overall performance. Most of these what-if studieswould be particularly diﬃcult to conduct through real experiments because ofthe diﬃculty to ﬁnely control the platform. This is to the best of our knowledgeone of the ﬁrst sensitivity analyses of a real HPC code accounting for platformuncertainty.As future work, building on the eﬀort of SimGrid developers on support-ing the emulation of a wide variety of applications with SMPI [27], we alsointend to conduct similar studies with other HPC benchmarks (e.g., HPCG [28]or HPGMG [29]), real applications (e.g., BigDFT [30]) and larger infrastruc-tures. As explained in this article, a good model of compute kernels and theMPI library is essential. Thereby, the main challenge for systematic use ofour simulation technique now lies in the automation of measurements throughwell-designed experiments and the automatic detection of when the envisionedmodels miss essential characteristics of the platform (multi-modal behaviors,heteroscedasticity, discontinuities,. . . ). We intend to provide a fully automaticcalibration procedure for MPI as well as for every BLAS function, which wouldallow us to eﬀortlessly predict the performance of many applications by simplylinking against a BLAS-replacement library.

Acknowledgments eferences [1] A. Degomme, A. Legrand, G. Markomanolis, M. Quinson, M. S. Still-well, F. Suter, Simulating MPI applications: the SMPI approach, IEEETransactions on Parallel and Distributed Systems 28 (8) (2017) 2387–2400. doi:10.1109/TPDS.2017.2669305 .[2] C. L. Janssen, H. Adalsteinsson, S. Cranford, J. P. Kenny, A. Pinar,D. A. Evensky, J. Mayo, A simulator for large-scale parallel architectures,International Journal of Parallel and Distributed Systems 1 (2) (2010)57–73, http://dx.doi.org/10.4018/jdst.2010040104 . doi:10.4018/jdst.2010040104 .[3] C. Engelmann, Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road ToExascale, FGCS 30 (2014) 59–65.[4] C. Engelmann, T. Naughton, A network contention model for the extreme-scale simulator, in: Proceedings of the th IASTED International Confer-ence on Modelling, Identiﬁcation and Control (MIC) 2015, ACTA Press,Innsbruck, Austria, 2015. doi:10.2316/P.2015.826-043 .[5] T. Cornebize, A. Legrand, F. C. Heinrich, Fast and Faithful PerformancePrediction of MPI Applications: the HPL Case Study, in: 2019 IEEE In-ternational Conference on Cluster Computing (CLUSTER), Albuquerque,United States, 2019. doi:10.1109/CLUSTER.2019.8891011 .URL https://hal.inria.fr/hal-02096571 [6] H. W. Meuer, E. Strohmaier, J. Dongarra, H. D. Simon, The TOP500:History, Trends, and Future Directions in High Performance Computing,1st Edition, Chapman & Hall/CRC, 2014.[7] A. Petitet, C. Whaley, J. Dongarra, A. Cleary, P. Luszczek, HPL - aportable implementation of the High-Performance Linpack benchmark for34istributed-memory computers, , ver-sion 2.2 (February 2016).[8] H. Casanova, A. Giersch, A. Legrand, M. Quinson, F. Suter, Versatile, Scal-able, and Accurate Simulation of Distributed Applications and Platforms,Journal of Parallel and Distributed Computing 74 (10) (2014) 2899–2917.[9] P. Velho, L. Schnorr, H. Casanova, A. Legrand, On the Validity of Flow-level TCP Network Models for Grid and Cloud Simulations, ACM Trans-actions on Modeling and Computer Simulation 23 (4) (2013) 23.[10] A. Denis, A High-Performance Superpipeline Protocol for InﬁniBand, in:E. Jeannot, R. Namyst, J. Roman (Eds.), Euro-Par 2011, Vol. 6853 ofLecture Notes in Computer Science, Springer, Bordeaux, France, 2011, pp.276–287.URL https://hal.inria.fr/inria-00586015 [11] B. Carpenter, A. Gelman, M. D. Hoﬀman, D. Lee, B. Goodrich, M. Be-tancourt, M. Brubaker, J. Guo, P. Li, A. Riddell, Stan: A probabilis-tic programming language, Journal of Statistical Software 76 (1) (2017). doi:10.18637/jss.v076.i01 .URL https://doi.org/10.18637%2Fjss.v076.i01 [12] Y. Inadomi, T. Patki, K. Inoue, M. Aoyagi, B. Rountree, M. Schulz,D. Lowenthal, Y. Wada, K. Fukazawa, M. Ueda, M. Kondo, I. Miyoshi,Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing, in: SC ’15: Proceedings of the InternationalConference for High Performance Computing, Networking, Storage andAnalysis, 2015, pp. 1–12. doi:10.1145/2807591.2807638 .[13] E. A. León, I. Karlin, A. Bhatele, S. H. Langer, C. Chambreau, L. H.Howell, T. D’Hooge, M. L. Leininger, Characterizing parallel scientiﬁc ap-plications on commodity clusters: An empirical study of a tapered fat-tree,35n: SC ’16: Proceedings of the International Conference for High Perfor-mance Computing, Networking, Storage and Analysis, 2016, pp. 909–920. doi:10.1109/SC.2016.77 .[14] P. Taﬀet, S. Rao, E. León, I. Karlin, Testing the limits of tapered fat treenetworks, in: 2019 IEEE/ACM Performance Modeling, Benchmarking andSimulation of High Performance Computer Systems (PMBS), 2019, pp.47–52. doi:10.1109/PMBS49563.2019.00011 .[15] P. Luszczek, J. Dongarra, Reducing the time to tune parallel dense lin-ear algebra routines with partial execution and performance modeling, in:Parallel Processing and Applied Mathematics, Springer Berlin Heidelberg,2012, pp. 730–739. doi:10.1007/978-3-642-31464-3_74 .URL https://doi.org/10.1007%2F978-3-642-31464-3_74 [16] G. Zheng, G. Kakulapati, L. Kale, BigSim: A Parallel Simulator for Per-formance Prediction of Extremely Large Parallel Machines, in: Proc. of the18th IPDPS, 2004.[17] R. M. Badia, J. Labarta, J. Giménez, F. Escalé, Dimemas: Predicting MPIApplications Behaviour in Grid Environments, in: Proc. of the Workshopon Grid Applications and Programming Tools, 2003.[18] M. Mubarak, C. D. Carothers, R. B. Ross, P. H. Carns, Enabling paral-lel simulation of large-scale HPC network systems, IEEE Transactions onParallel and Distributed Systems (2016).[19] H. Casanova, F. Desprez, G. S. Markomanolis, F. Suter, Simulation of MPIapplications with time-independent traces, Concurrency and Computation:Practice and Experience 27 (5) (2015) 24. doi:10.1002/cpe.3278 .URL https://hal.inria.fr/hal-01232776 [20] X. Wu, F. Mueller, ScalaExtrap: Trace-based communication extrapolationfor SPMD programs, in: Proc. of the 16th ACM Symp. on Principles andPractice of Parallel Programming, 2011, pp. 113–122.3621] L. Carrington, M. Laurenzano, A. Tiwari, Inferring large-scale computationbehavior via trace extrapolation, in: Proc. of the Workshop on Large-ScaleParallel Processing, 2013.[22] R. F. Bird, S. A. Wright, D. A. Beckingsale, S. A. Jarvis, Performancemodelling of magnetohydrodynamics codes, in: M. Tribastone, S. Gilmore(Eds.), Computer Performance Engineering, Springer Berlin Heidelberg,Berlin, Heidelberg, 2013, pp. 197–209.[23] G. Xu, H. Ibeid, X. Jiang, V. Svilan, Z. Bian, Simulation-based per-formance prediction of hpc applications: A case study of hpl, in: 2020IEEE/ACM International Workshop on HPC User Support Tools (HUST)and Workshop on Programming and Performance Visualization Tools (Pro-Tools), 2020, pp. 81–88. doi:10.1109/HUSTProtools51951.2020.00016 .[24] J. Lowe-Power, A. M. Ahmad, A. Akram, M. Alian, R. Amslinger, M. An-dreozzi, A. Armejach, N. Asmussen, B. Beckmann, S. Bharadwaj, G. Black,G. Bloom, B. R. Bruce, D. R. Carvalho, J. Castrillon, L. Chen, N. Deru-migny, S. Diestelhorst, W. Elsasser, C. Escuin, M. Fariborz, A. Farmahini-Farahani, P. Fotouhi, R. Gambord, J. Gandhi, D. Gope, T. Grass,A. Gutierrez, B. Hanindhito, A. Hansson, S. Haria, A. Harris, T. Hayes,A. Herrera, M. Horsnell, S. A. R. Jafri, R. Jagtap, H. Jang, R. Jeyapaul,T. M. Jones, M. Jung, S. Kannoth, H. Khaleghzadeh, Y. Kodama, T. Kr-ishna, T. Marinelli, C. Menard, A. Mondelli, M. Moreto, T. Mück, O. Naji,K. Nathella, H. Nguyen, N. Nikoleris, L. E. Olson, M. Orr, B. Pham,P. Prieto, T. Reddy, A. Roelke, M. Samani, A. Sandberg, J. Setoain,B. Shingarov, M. D. Sinclair, T. Ta, R. Thakur, G. Travaglini, M. Up-ton, N. Vaish, I. Vougioukas, W. Wang, Z. Wang, N. Wehn, C. Weis, D. A.Wood, H. Yoon, Éder F. Zulian, The gem5 simulator: Version 20.0+ (2020). arXiv:2007.03152 .[25] T. Grass, C. Allande, A. Armejach, A. Rico, E. Ayguadé, J. Labarta,37. Valero, M. Casas, M. Moreto, Musa: A multi-level simulation approachfor next-generation hpc machines, in: Proceedings of the International Con-ference for High Performance Computing, Networking, Storage and Anal-ysis, SC ’16, IEEE Press, Piscataway, NJ, USA, 2016, pp. 45:1–45:12.URL http://dl.acm.org/citation.cfm?id=3014904.3014965 [26] F. C. Heinrich, T. Cornebize, A. Degomme, A. Legrand, A. Carpen-Amarie,S. Hunold, A.-C. Orgerie, M. Quinson, Predicting the Energy Consumptionof MPI Applications at Scale Using a Single Node, in: Proc. of the 19thIEEE Cluster Conference, 2017.URL https://hal.inria.fr/hal-01523608 [27] S. Developers, Continuous integration of ECP Proxy Apps, CORALand Trinity-Nersc benchmarks with SimGrid/SMPI, https://github.com/simgrid/SMPI-proxy-apps/https://github.com/simgrid/SMPI-proxy-apps/