Simulation-based Optimization and Sensibility Analysis of MPI Applications: Variability Matters
SSimulation-based Optimization and Sensibility Analysisof MPI Applications: Variability Matters
Tom Cornebize, Arnaud Legrand
Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, Francefi[email protected]
Abstract
Finely tuning MPI applications and understanding the influence of key param-eters (number of processes, granularity, collective operation algorithms, virtualtopology, and process placement) is critical to obtain good performance on su-percomputers. With the high consumption of running applications at scale,doing so solely to optimize their performance is particularly costly. Havinginexpensive but faithful predictions of expected performance could be a greathelp for researchers and system administrators. The methodology we proposedecouples the complexity of the platform, which is captured through statisti-cal models of the performance of its main components (MPI communications,BLAS operations), from the complexity of adaptive applications by emulatingthe application and skipping regular non-MPI parts of the code. We demon-strate the capability of our method with High-Performance Linpack (HPL), thebenchmark used to rank supercomputers in the TOP500, which requires care-ful tuning. We briefly present (1) how the open-source version of HPL can beslightly modified to allow a fast emulation on a single commodity server at thescale of a supercomputer. Then we present (2) an extensive (in)validation studythat compares simulation with real experiments and demonstrates our ability topredict the performance of HPL within a few percent consistently. This studyallows us to identify the main modeling pitfalls (e.g., spatial and temporal nodevariability or network heterogeneity and irregular behavior) that need to be con-sidered. Last, we show (3) how our “surrogate” allows studying several subtleHPL parameter optimization problems while accounting for uncertainty on theplatform.
Preprint submitted to Elsevier February 16, 2021 a r X i v : . [ c s . D C ] F e b . Introduction Today, supercomputers with 100,000 cores and more are common, and sev-eral machines beyond the 1,000,000 cores mark are already in production. Thesecompute resources are interconnected through complex non-uniform memoryhierarchies and network infrastructures. This complexity requires careful opti-mization of application parameters, such as granularity, process organization,or algorithm choice, as these have an enormous impact on load distribution andnetwork usage. Scientific application developers and users often spend a sub-stantial amount of time and effort running their applications at different scalessolely to tune parameters for optimizing their performance. Whenever actualperformance does not match expectations, it can be challenging to understandwhether the mismatch originates from application misunderstanding or machinemisconfiguration. Similar difficulties are encountered when (co-)designing su-percomputers for specific applications. A large part of this tuning work couldbe simplified if a generic and faithful performance prediction tool was available.This article presents a decisive step in this direction.Several techniques have been proposed to predict the performance of a givenapplication on a supercomputer. A first approach consists in building a mathe-matical performance model (i.e., an analytic formula) accounting for both plat-form and application key characteristics. However, it is rarely accurate, exceptfor elementary applications on highly regular and well-provisioned platforms,and can thus be merely used to predict broad trends. A more precise approachconsists in capturing a trace of the application at scale and replaying it usinga simulator. This is an effective approach for capacity planning, but since theapplication trace is specific to a given set of parameters (and even specific toa given run for dynamic applications that exhibit non-deterministic behaviorsdue to, e.g., the use of asynchronous collective operations), it cannot be usedto study how application parameters should be set for optimizing performance.The main difficulty resides in capturing and modeling the interplay betweenthe application and the platform while faithfully accounting for their respective2omplexity. A promising approach recently pioneered in several tools [1, 2, 3]consists in emulating the application in a controlled way so that a platformsimulator governs its execution. Although this approach’s scalability is a pri-mary concern that has already received lots of attention, the accuracy of thesimulation is even more challenging. It is still an open research question sinceEngelmann and Naughton [4] report, for example, an error ranging from 20%to 40% for NPB LU when using 128 ranks.In a previous publication [5], we presented how an application like HPL canbe emulated at a reasonable cost on a single commodity server to study scenar-ios similar to qualification runs of supercomputers for the Top500 ranking [6].We also showed how to predict the performance of HPL for a specific set ofparameters on a recent cluster (running a thousand MPI ranks) within a fewpercent of reality. HPL is particularly challenging to study because it imple-ments several custom non-trivial MPI collective communication algorithms tooverlap communications with computations efficiently. It is also particularlysensitive to platform variability (both spatial and temporal). In this article, weconduct an extensive validation study. We show that our approach allows usto consistently predict the real-life performance of HPL within a few percentregardless of its input parameters, thereby showing that application parameterscan be tuned fully in simulation. Throughout this validation, which spannedover two years, we also highlight key issues that may arise when modeling theplatform and should be carefully addressed to obtain reliable predictions. Last,given the sensibility of applications to computing and communication resourcevariability, we showcase how to conduct what-if performance analysis of HPCapplications in a capacity planning context.This article is organized as follows: Section 2 presents the main character-istics of the HPL application and provides information on how the runs areconducted on modern supercomputers. In Section 3, we briefly present the sim-ulator we used for this work, SimGrid/SMPI, and the modifications of HPL thatwere required to obtain a scalable simulation and some initial validation resultspresented in an earlier work [5]. These results highlight the importance of mod-3 B L UA N allocate and initialize A for k = N to step NB do allocate the panelfactor the panelbroadcast the panelupdate the sub-matrix; Figure 1: Overview of High-Performance Linpack. eling both spatial and temporal variability. In Section 4, we compare simulationresults with real experiments through two typical HPL performance studies thatcover a wider range of application parameters. Section 5 presents how our HPLsurrogate can be used to study and possibly optimize the performance of HPLin the presence of uncertainty on the platform. Section 6 discusses related workand explains how our approach compares with other approaches. Section 7concludes by discussing future work.
2. Background on High-Performance Linpack
HPL implements a matrix factorization based on a right-looking variantof the LU factorization with row partial pivoting and allows for multiple look-ahead depths. In this work, we use the freely-available reference-implementationof HPL [7], which relies on MPI, and from which most vendor-specific imple-mentations (e.g., from Intel or ATOS) have been derived. Figure 1 illustratesthe principle of the factorization which consists of a series of panel factorizationsfollowed by an update of the trailing sub-matrix. HPL uses a two-dimensionalblock-cyclic data distribution of A , which allows for a smooth load-balancing ofthe work across iterations.The sequential computational complexity of this factorization is flop( N ) = N + 2 N + O ( N ) where N is the order of the matrix to factorize. The timecomplexity on a P × Q processor grid can thus be approximated by T ( N ) ≈ (cid:0) N + 2 N (cid:1) P · Q · w + Θ(( P + Q ) · N ) , where w is the flop rate of a single node and the second term corresponds to the4ommunication overhead which is influenced by the network capacity and manyconfiguration parameters of HPL. Indeed, HPL implements several custom MPIcollective communication algorithms to efficiently overlap communications withcomputations. The main parameters of HPL are thus:• N is the order of the square matrix A .• NB is the “blocking factor”, i.e., the granularity at which HPL operates whenpanels are distributed or worked on. This parameter influences the efficiencyof the dgemm BLAS kernel, which is the kernel used in the sub-matrix updates,but also the efficiency of MPI communications.• P and Q denote the number of process rows and process columns. For thisalgorithm, the total amount of data transfers is proportional to ( P + Q ) .N ,which generally favors virtual topologies where P and Q are approximatelyequal.• RFACT determines the panel factorization algorithm. Possible values are
Crout , left- or right-looking .• SWAP specifies the swapping algorithm used while pivoting. Two algorithmsare available: one is based on a binary exchange (along a virtual tree topology)and the other one is based on a spread-and-roll (with a higher number ofparallel communications). HPL also provides a panel-size threshold triggeringa switch from one variant to the other.•
BCAST sets the algorithm used to broadcast a panel of columns over the processcolumns. Legacy versions of the MPI standard only supported non-blockingpoint-to-point communications, which is why HPL ships with in total 6 self-implemented variants to overlap the time spent waiting for an incoming panelwith updates to the trailing matrix: ring , ring-modified , , , long , and long-modified . The modified versions guarantee thatthe process right after the root (i.e., the process that will become the rootin the next iteration) receives data first and does not further participate5 able 1: Typical HPL configurations. Stampede@TACC Theta@ANL
Rpeak . − . − N , ,
000 8 , , NB P × Q ×
78 32 × RFACT
Crout Left
SWAP
Binary-exch. Binary-exch.
BCAST
Long modified 2 Ring modified
DEPTH
Rmax . − . − Duration 2 hours 28 hoursMemory
120 TB 559 TB
MPI ranks 1/node 1/node in the broadcast. This process can thereby start working on the panel assoon as possible. The ring and versions each broadcast along thecorresponding virtual topologies while the long version is a spread and roll algorithm where messages are chopped into Q pieces. This generally leadsto better bandwidth exploitation. The ring and variants rely on MPI_Iprobe , meaning they return control if no message has been fully receivedyet, hence facilitating partial overlap of communication with computations.In HPL 2.1 and 2.2, this capability has been deactivated for the long and long-modified algorithms. A comment in the source code states that somemachines apparently get stuck when there are too many ongoing messages.•
DEPTH controls how many iterations of the outer loop can overlap with eachother. As indicated in the HPL documentation, a depth equal to 1 often givesbetter results than a depth equal to 0 for large problem sizes, but a look-aheadof depth equal to 3 and larger is not expected to bring any improvement.All the previously listed parameters interact uniquely with the interconnec-tion network capability and the MPI library to influence the overall performanceof HPL, which makes it very difficult to predict precisely. To illustrate the di-versity of real-life configurations, we report in Table 1 a few ones used for theTOP500 ranking that some colleagues agreed to share with us.The performance typically achieved by supercomputers (
Rmax ) needs to be6ompared to the much larger peak performance (
Rpeak ). The difference can beattributed to the node usage, to the MPI library, to the network topology thatmay be unable to deal with the intense communication workload, to load imbal-ance among nodes (e.g., due to a defect, system noise,. . . ), to the algorithmicstructure of HPL, etc. All these factors make it difficult to know precisely whatperformance to expect without running the application at scale. Due to the com-plexity of both HPL and the underlying hardware, simple performance models(analytic expressions based on
N, P, Q and estimations of platform characteris-tics as presented in Section 2) may at best be used to determine broad trendsbut can by no means accurately predict the performance for each configuration(e.g., consider the exact effect of HPL’s six different broadcast algorithms onnetwork contention). Additionally, these expressions do not allow engineers toimprove the performance through actively identifying performance bottlenecks.For complex optimizations such as partially non-blocking collective communi-cation algorithms intertwined with computations, a very faithful model of boththe application and the platform is required.
3. Emulating HPL with SimGrid/SMPI
In this section, we present an overview of Simgrid/SMPI and of the modi-fications of HPL required to obtain a scalable simulation and a first validationof the simulations. The results of Sections 3.1– 3.4 previously appeared in aconference publication [5] and are included here for completeness.
SimGrid [8] is a flexible and open-source simulation framework that was ini-tially designed in 2000 to study scheduling heuristics tailored to heterogeneousgrid computing environments but was later extended to study cloud and HPCinfrastructures. The main development goal for SimGrid has been to providevalidated performance models, particularly for scenarios making heavy use ofthe network. Such a validation usually consists of comparing simulation pre-dictions with real experiments to confirm or debunk and improve network andapplication models. 7MPI, a simulator based on SimGrid, has been developed and used to sim-ulate unmodified MPI applications written in C/C++ or FORTRAN [1]. Tothis end, SMPI maps every MPI rank of the application onto a lightweight sim-ulation thread. These threads are then run in mutual exclusion and controlledby SMPI, which measures the time spent computing between two MPI calls.This duration is injected in the simulator as a simulated delay, scaled up ordown depending on the speed difference between the simulated machine andthe simulation machine.The complex optimizations done in real MPI implementations need to beconsidered when predicting the performance of applications. For instance, the“eager” and “Rendez-vous” protocols are selected based on the message size,with each protocol having its synchronization semantic, which strongly impactsperformance. Another problematic issue is to model network topologies andcontention. SMPI relies on SimGrid’s communication models, where each ongo-ing communication is represented as a single flow (as opposed to a collection ofindividual packets). Assuming steady-state, contention between active commu-nications can then be modeled as a bandwidth sharing problem while accountingfor non-trivial phenomena (e.g., cross-traffic interference [9]). The time spentin MPI is thus derived from the SMPI network model that accounts for MPIpeculiarities (depending on the message size), the machine topology, and thecontention with all other ongoing flows. For more details, we refer the inter-ested reader to [1].
HPL relies heavily on BLAS kernels such as dgemm (for matrix-matrix multi-plication). Since these kernels’ output does not influence the control flow, sim-ulation time can be reduced considerably by substituting these function callswith a performance model of the respective kernel. Figure 2 shows an exampleof this macro-based mechanism that allows us to keep HPL code modificationsto an absolute minimum. The (1.029e-11) value represents the inverse of theflop rate for this compute kernel and is obtained by benchmarking the target8 define HPL_dgemm(layout, TransA, TransB, M, N, K, \alpha, A, lda, B, ldb, beta, C, ldc) ({ \double size = ((double)M)*((double)N)*((double)K); \double expected_time = 1.029e-11*size + 1.981e-12; \smpi_execute_benched(expected_time); \})
Figure 2: Non-intrusive macro replacement with a very simple performance model. nodes. The kernel’s estimated duration is calculated based on the given pa-rameters and passed on to smpi_execute_benched that advances the simulatedclock of the executing rank by this estimate. Skipping compute kernels makesthe content of output variables invalid, but in simulation, only the application’sbehavior and not the correctness of computation results are of concern. Theseminor modifications to the original source code (HPL comprises 16K lines ofANSI C over 149 files, our modifications only changed 14 files with 286 lineinsertions and 18 deletions) enabled us to simulate the configuration used forthe Stampede cluster in 2013 for the TOP500 ranking (see Table 1) in less than62 hours and using 19 GB on a single node of a commodity cluster (insteadof 120TB of RAM over a 6006 node supercomputer). Further speed-up couldprobably be obtained by modifying HPL further, but our primary interest inthis article is on the prediction quality.Most BLAS kernels have several parameters from which a straightforwardmodel can generally easily be identified (e.g., proportional to the product ofthe parameters), but refinements including the individual contribution of eachparameter as well as the spatial and temporal variability of the operation arealso possible. In the following, all the simulations have been done with thefollowing model for the dgemm kernel:For each processor p , dgemm p ( M, N, K ) ∼ H ( µ p , σ p ) µ p = α p M N K + β p M N + γ p M K + δ p N K + (cid:15) p σ p = ω p M N K + ψ p M N + φ p M K + τ p N K + ρ p , (1)where H ( µ, σ ) denotes a half-normal random variable with parameters µ, σ ac-counting for the expectation and the standard deviation. The dependency on p α p , β p , . . . , ρ p can be specificto each node), i.e., the aforementioned spatial variability. The σ p parameter al-lows to account for (short-term) temporal variability, i.e., to model the fact thatthe duration of two successive calls to dgemm with the same parameters M, N, K are never identical. Modeling this variability is important as it may propagatethrough the communication pattern of the application (late sends and late re-ceives). Last, the rationale for using a half-normal distribution rather than anormal distribution stems from the natural positive skewness of compute kernelduration. This model is much more complex than the simple deterministic oneused in Figure 2 but, as we will explain, this complexity is key to obtain goodperformance prediction [5]. However, all other kernels’ duration (which repre-sent a negligible fraction of the overall execution time) have been modeled witha simple deterministic and homogeneous model such as daxpy ( N ) = αN + β . To evaluate the soundness of our approach, we compare several real execu-tions of HPL with simulations using the previous models. We used the Dahucluster from the Grid’5000 testbed. It has 32 nodes connected through a singleswitch by
100 Gbit s − Omnipath links. Each node has two Intel Xeon Gold6130 CPUs with 16 cores per CPU, and we disabled hyperthreading. We usedHPL version 2.2 compiled with GCC version 6.3.0. We also used the librariesOpenMPI version 2.0.2 and OpenBLAS version 0.3.1. Unless specified other-wise, HPL executions were done using a block size of 128, a matrix of varyingsize (from , to , ), one single-threaded MPI rank per core, a look-ahead depth of 1, and the increasing-2-ring broadcast with the Crout panelfactorization algorithms as this is the combination that led to the best perfor-mance overall. Although this machine is much smaller than top supercomputers,faithfully simulating an HPL execution with such settings is quite challenging.• We used one rank per core to obtain a higher number (1024) of MPI processes.This configuration is more difficult than simulating one rank per node, as(1) it increases the amount of data transferred through MPI, and (2) the10 l l ll l lllll l ll l llll l ll ll llll l lll ll l l llll l ll lll ll l l ll ll ll ll lll l ll l ll l l ll l ll l lllll l ll l llll l ll ll llll l lll ll l l llll ll l ll l l ll l ll l l ll l ll l ll l ll l l ll l ll ll l l lll l (b) Heterogeneous & deterministic (c) Heterogeneous & stochastic(a) Homogeneous & deterministic(b) Heterogeneous & deterministic (c) Heterogeneous & stochastic(a) Homogeneous & deterministic
RealityRealityReality +9%−1%+33%
Matrix rank G f l op / s Figure 3: HPL performance: predictions (dashed lines) vs. reality (solidlines). NB P × Q × PFACT
Crout
RFACT
Right
SWAP
Binary-exch.
BCAST
DEPTH performance is subject to memory interference and network heterogeneity(intra-node communications vs. inter-node communications).• We used a much smaller block size than what is commonly used, which leadsto a higher number of iterations and hence more complex communicationpatterns.• We used relatively small input matrices, which reduces the makespan andmakes good predictions harder to obtain. We now present a first quantitative comparison of the prediction with thereality on a typical scenario. We report in Figure 3 the
GFlop s − rate reportedby HPL when varying the matrix size N . Real executions are depicted in solidblack, and the natural variability of the overall performance is illustrated byreporting eight runs of HPL for each matrix size. The dashed line (a), on top,is our first attempt to simulate HPL with the naive model (homogeneous anddeterministic for both the kernels and for the network) illustrated in Figure 2.This model overestimates HPL performance by more than
30 % . Modeling theheterogeneity of dgemm (i.e., introducing the dependency on p for dgemm as done inEq (1) but without the temporal variability induced by σ p ) increases significantlythe realism of the simulation as the performance is then overestimated by only (dashed line (b)). Finally, we found that adding the temporal variability is11he key ingredient to obtain the last bit of realism. The prediction using thefull-fledged model (dashed line (c)) is extremely close to reality as it slightlyunderestimates the performance by less than and even as little as forthe larger matrices.As illustrated in Figure 3 and explained in our previous work [5], accuratepredictions require careful modeling of both spatial and temporal variability, asthey appear to have a very strong effect on HPL performance. Somehow, thisis expected since HPL is an iterative program that synchronizes through thebroadcast of factorization panels. A single slower or late process will eventuallydelay all the other ones. In the scenario presented in Figure 3, a large fractionof the overall execution time is spent in MPI communications but foremost insynchronizations (induced by late sends and let receives) rather than in actualdata transfers. As a consequence, careful modeling of computations is essential,but careless modeling of the network was enough to obtain good predictions.This article presents an extensive (in)validation study that demonstrates theimportance of careful modeling of the whole platform. This validation study has been carried out over several years (from 2018 to2020). Despite our efforts to keep the experimental setup stable for the sakeof reproducibility, the platform has evolved. The Linux kernel had a minorupdate, from version 4.9.0-6 to version 4.9.0-13, and the BIOS and firmwareof the nodes have been upgraded. During this time frame, the cluster has alsosuffered from hardware issues, like a cooling malfunction on four of its nodes andseveral faulty memory modules that had to be changed. This malfunction hadan enormous impact on the performance of HPL, which significantly complicatedour validation study but also makes it more meaningful as it has been conductedon a particularly challenging setup.Our simulation approach makes it possible to predict the performance ofHPL for a new platform state by merely making a new calibration whenever asignificant change is detected. This ability to reflect in simulation a platform12 ormal cooling Problematic cooling1e+05 2e+05 3e+05 4e+05 5e+05 1e+05 2e+05 3e+05 4e+05 5e+050100002000030000
Matrix rank G f l op / s Reality Simulation NB P × Q × PFACT
Crout
RFACT
Right
SWAP
Binary-exch.
BCAST
DEPTH Figure 4: HPL performance: predictions vs. reality (effect of the cooling issue on the nodesdahu-{13,14,15,16}). change is illustrated in Figure 4 which, similarly to Figure 3 (acquired in March2019), showcases the influence of matrix size on the performance but at differentperiods. The left plot represents the normal state of the cluster (in September2020), whereas the right plot has been obtained (in March-April 2019) when 4of the 32 nodes had a cooling issue which lowered their performance by about10%. In all cases, we consistently predict performance within a few percent andperforming a new dgemm calibration on these four nodes was all that was neededto reflect this platform change in the simulation.This result illustrates both the faithfulness of our simulations and a poten-tial use case for predictive simulations: a discrepancy between the reality andthe predictions can sometimes indicate a real issue on the platform (similarsituations have already been reported in [1]).
4. Comprehensive Validation Through HPL Performance Tuning
This section reports a few typical performance studies involving HPLthrough both real experiments and simulations. The comparison of both ap-proaches allows us (1) to cover a broader range of parameters than solely ma-trix size as done in our earlier work, (2) to evaluate how faithful to reality oursimulations are even in suboptimal configurations, and (3) to report the maindifficulties encountered when conducting such a study.13 a) Illustrating the effect of the two MPI calibration methods. The optimistic method merely sam-ples message size smaller than and extrapolates for larger sizes. Unfortunately, for messageslarger than
160 MB , the effective bandwidth significantly drops. The more realistic calibration mea-sures MPI communication duration for messages up to while injecting some computation loadin the background.
Optimistic calibrationOptimistic calibrationRealisticcalibrationRealisticcalibration P G fl op / s Reality Simulation (optimisticnetwork calibration) Simulation (realisticnetwork calibration) N , NB P × Q PFACT
Crout
RFACT
Right
SWAP
Binary-exch.
BCAST
DEPTH (b) HPL performance: predictions vs. reality (testing all the possible geometries for 960 MPI ranks). Figure 5: The first (optimistic) network calibration gave poor predictions for very elongatedgeometries while the improved calibration provides perfect predictions. .1. Evaluating the Influence of the Geometry Figure 5(b) illustrates the influence on the performance of the geometry ofthe virtual topology ( P and Q ) used in HPL. As expected, geometries that aretoo distorted lead to degraded performance. All the HPL parameters were fixed(matrix rank is fixed to , and the other parameters are the same as inSection 3.4) except for the geometry as we evaluate all the pairs ( P , Q ) such that P × Q = 960 . We used only 30 nodes instead of 32 to cover a larger number ofgeometries, as has more divisors than .As in all our previous studies, we report both the predicted performanceand the one measured in reality. Like the comparisons presented in the previoussection, the simulation was done with the dgemm model from Eq (1) (stochastic,heterogeneous, and polynomial) and the simplest linear models for the otherkernels. In our first simulation attempt that relied on a relatively simple net-work model (deterministic yet piecewise-linear to account for protocol switch)depicted on the leftmost plot of Figure 5(a), we obtained the unsatisfying or-ange line on top of Figure 5(b) for the prediction. The simulations with thesmallest value of P had relatively large prediction errors, with a systematicover-estimation that reaches up to +50% for the × and × geometries.A qualitative comparison of the execution traces obtained in reality and simula-tion showed that the broadcast phases’ duration was greatly underestimated insimulation. We found out that with such elongated geometries, the message sizeis significantly larger than what we had used in our calibration, and the per-formance surprisingly and significantly drops for such size (compare with therightmost plots of Figure 5(a)). This performance drop is explained by poor op-timization of the DMA locking mechanism in the Infiniband network layer [10].A similar performance drop also happens for intra-node communications thatpoorly manage the caches above a given size. Furthermore, the communicationpatterns generated by HPL during the ring broadcast are significantly impactedby the busy waiting of HPL that intensively calls MPI_Probe and dgemm on smallsub-matrices. Our initial procedure for calibrating the network did not capturethis phenomenon since we did not inject any additional CPU load.15 rg 1rM 2rg 2rM Lng LnM 1rg 1rM 2rg 2rM Lng LnM
Depth: 0 Depth: 1 b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x Swap G fl op / s Block size: 128
Depth: 0 Depth: 1 b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x b i n - e xc h l ong m i x Swap
Block size: 256
Mode
Reality Simulation
Error
0% - 5% 5% - 10% > 10%
Figure 6: Influence of HPL configuration on the performance (factorial experiment). Param-eters have been reorganized based on their influence on performance to improve readibility.The boxed configuration corresponds to the one boxed in Figure 3.
We addressed this problem by improving our network calibration procedure:(1) we use a distinct model for local and remote calibrations, (2) we sample themessage sizes in a larger interval, and (3) we add calls to dgemm and
MPI_Iprobe between each call to
MPI_Send and
MPI_Recv . The goal was to make the calibra-tion environment more similar to what happens in HPL. The resulting networkmodel is illustrated in the rightmost plots of Figure 5(a). This more realisticnetwork model solved every previous misprediction and allows us to producevery faithful simulations (purple line on Figure 5(b)), which are now a few per-cent of the reality regardless of the geometry. This figure also illustrates theinfluence of the geometry on overall performance since there is almost a factorof ten between the worst configuration ( × ) and the best one ( × ).Although it is not surprising to see that the geometries which are as square aspossible lead to better performance as they minimize the overall amount of datamovements, it is interesting to observe the asymmetric role of P and Q in theoverall performance (smaller values for P lead to better performance) and whichcan be explained by the structure of the collective operations but requires aclose look at the code. 16 .2. Optimizing the HPL Configuration Through a Factorial Experiment Although geometry is among the most important parameter to tune, sixother parameters control the behavior of HPL. In Figure 6, we compare the per-formance reported by HPL when fixing the matrix rank to , and varyingthe following parameters: block size (128 or 256), depth (0 or 1), broadcast (thesix available algorithms), swap (the three available algorithms). The geometrywas fixed to P × Q = 32 ×
32 = 1024 as it is optimal (the simpler calibrationprocedure and the network model depicted on the leftmost plot of Figure 5(a)were thus used). The parameters pfact and rfact (panel factorization) wererespectively fixed to
Crout and
Right , as they had nearly no influence on HPLperformance in our early experiments.Figure 6 depicts the 72 parameter combinations we tested. These param-eters account for up to
30 % of variability in the performance, which is lessimportant than the geometry but is still quite significant. For 61 of them, theprediction error is lower than . Only two combinations have shown a largeerror of approximately
15 % , obtained with a block size of 256, a depth of 1, the broadcast algorithm, and either the long or the mix swap algorithm. Thisdemonstrates the soundness of our approach, as our predictions are reasonablyaccurate most of the time. This experiment confirms that, although the predic-tion of HPL performance for a given parameter combination has a systematicbias, the error remains within a few percent most of the time. Therefore, thissurrogate is good enough for parameter tuning and should be considered whenpreparing a large-scale run.While testing all the parameter combinations is the safest method to dis-cover the combination that provides the highest performance, its cost can beprohibitive due to the high number of combinations. An alternative often usedin practice is to explore only a small subset of the parameter space and to an-alyze variance (ANOVA) to identify the parameters with the more substantialeffect on performance and then select the appropriate combination. We appliedthis procedure on samples of both datasets (the one obtained from real runsand the one obtained in simulation). In both cases, the two parameters with17he highest effect were the block size NB and the depth , as shown in Figure 6,followed by bcast and swap . The best combinations selected in both cases werealso identical, demonstrating once again the faithfulness of our simulation ap-proach and how it can be used to reduce the experimental cost of parametertuning. Accurately predicting the performance of an application is not a trivial task.Discrepancies between reality and simulation can be multiple: the platform mayhave changed (e.g., the cooling issue that affected four nodes in Section 3.5), themodel could be inaccurate (e.g., the homogeneous and deterministic dgemm modelis too simple as in [5]) or not correctly calibrated (e.g., the calibration proceduredoes not cover the appropriate parameter space, or the experimental conditionsare too different as in Section 4.1). As expected in any serious investigation ofmodel validity, our validation study is not a mere collection of positive cases.Instead, it is the result of a thorough (we extensively covered the HPL parameterspace) attempt to invalidate our model as well as explanations on how we didso. By meticulously overcoming each of these issues, we have demonstrated theability of our approach to produce very faithful predictions of HPL performanceon a given platform.
5. Sensibility Analysis in What-if Scenarios
We have shown that many typical HPL case studies could be conductedin simulation. However, their conclusions (optimal geometry and parameters)are specific to the cluster we used and they require a precise model of severalaspects of the target cluster, which may not be possible at early experimen-tal stages. In particular, only a few cluster nodes may be available at firstand the whole cluster model should then be constructed from a limited set ofobservations and carefully extrapolated. This section shows how typical what-if simulation studies should be conducted given such uncertainty. Section 5.1presents a generative model of node performance that can easily be fit from18aily measurements and used to produce a similar platform. This model is usedto quantify the importance on overall performance of temporal variability of the dgemm kernel in Section 5.2 and of spatial variability of nodes in Section 5.3. Inparticular, we show how to study the efficiency of a simple slow node eviction strategy. Finally, we study in Section 5.4 the influence of the physical networktopology on overall performance. Most of these studies are particularly difficultto conduct through real experiments because of the difficulty to finely controlthe platform.
As we have seen in Section 3.2, the performance of nodes exhibits severalkinds of variability: i) a spatial variability (between nodes) ii) a “short-term”temporal variability (the one experienced within an HPL run) but also iii) a“long term” temporal variability (from a day to another). As illustrated in Sec-tion 3.4, accounting for the first two kinds of variability is essential but duringour investigation of the simulation validity, which spanned over several months,we also had to deal with the fact that the node performance from a day to an-other could significantly vary, thereby making our comparisons between a realexperiment and the simulation driven by model obtained with past measure-ments sometimes irrelevant.This section explains how all sources of variability can be accounted for in asingle unified model. From our observations, we assume that on a given node p and a given day d , the duration of the dgemm kernel can be modeled as follows: ∀ M, N, K, dgemm p,d ( M, N, K ) ∼ H ( α p,d M N K + β p,d , γ p,d M N K ) (2)Compared to the model (1), this model includes the daily variability but dropsthe complexity of a full-fledged polynomial. Such complexity may be importantwhenever trying to model a particular platform. However, when performingsensibility analysis, a simpler model is preferred, especially as not all terms ofthe polynomial may be statistically significant. In this model, the short-termtemporal variability stems from the γ p,d term while the average performance19f the node stems from the α p,d and β p,d terms, which we gather in a single3-dimensional vector µ p,d = ( α p,d , β p,d , γ p,d ) . (3)Now, since every machine is unique it is natural to assume that for each machine: ∀ d, µ p,d ∼ N ( µ p , Σ T ) (4)In this model, µ p accounts for the average performance of the machine p , while Σ T accounts for its day-to-day variability. From our observation we had noparticular reason to assume that this variability was different from a machineto another, hence, Σ T is not indexed by p but global to all machines. However,the parameters α p,d , β p,d , γ p,d are generally correlated to each others, hence Σ T is full covariance matrix to account for interactions. The choice of a Normaldistribution is natural since it is the simpler distribution that accounts for aspecific mean and variance, but we will discuss its relevance later in this section.Finally, we need to account for the spatial variability, which we propose tomodel as follows: ∀ p, µ p ∼ N ( µ, Σ S ) (5)Again, in such a model µ accounts for the machines’ average performance while Σ S accounts for the (weak) heterogeneity. This hierarchical model is depictedin Figure 7.The relevance of model (2) has already been illustrated in Section 3.4, 3.5,and 4 but the relevance of models (4) and (5) requires some attention. Fig-ure 8(a) represent the empirical distribution of µ p,d = ( α p,d , β p,d , γ p,d ) (theresult of the linear regression) for the 32 nodes of the Dahu cluster on 40 differ-ent days from November 2019 to February 2020. The distribution for each nodeappears approximately normal and passed a Shapiro-Wilk normality test. Al-though the distribution of the β p,d appears slightly skewed toward larger valuesand one of the nodes (the one with the larger α p,d ) stands out, there is no goodreason for using a more complex distribution than a Gaussian one. Althoughthe correlation between α , β , and γ is very weak, it appeared to be statisti-20 p Σ T N P nodes µ Σ S t p,d ( M, N, K ) M, N, K H µ p,d = ( α p,d , β p,d , γ p,d ) µ p,d N D days P nodes Figure 7: Generative model of kernel duration accounting for the spatial ( Σ S ), long-term( Σ T ) and short-term variability ( γ p,d ). The shaded node represents observed variables anddiamond node represents deterministic variables, while non-shaded nodes represent latentvariables. The solid node is the variable which is estimated when conducting (in)validationstudies while the dashed ones are useful when conducting sensibility analysis and extrapolatingto an hypothetical cluster. cally significant (most ellipses are slightly tilted toward North-East), hence afull variance matrix is needed (at least for Σ T ).It is thus easy to estimate µ p and Σ T by averaging over the µ p,d of eachnode, and then to estimate µ and Σ S by averaging over all the nodes. Thismoment-matching method is simple and provides very good estimates for µ , Σ T , and Σ S because we have enough measurements at our disposal and becauseit is particularly suited to the Gaussian modeling assumption. Should morecomplex models (e.g., a mixture to account for “outlier” nodes or a SkewNor-mal distribution to account for the distribution’s skewness) be used, a generalBayesian sampling framework like STAN [11] would be more adapted. Suchframeworks allow to easily specify hierarchical generative models like the onepresented in Figure 7 and to draw samples from the posterior distribution of µ , Σ T , and Σ S , which can be used to generate realistic µ p,d values for a newhypothetical cluster easily.Such a process is depicted in Figure 8(b) where hypothetical regression pa-rameters for 16 nodes have been generated. Comparing such synthetic data21 e a l d a t a α β α γ (a) Distribution of α , β , and γ (observations on × CPUs from November 2019 to February2020). S y n t h e t i c d a t a α β α γ (b) Distribution of α , β , and γ (synthetic data for 16 CPUs). Figure 8: Distribution of the regression parameters for around 20 dgemm calibrations made oneach of the 32 nodes. Each color/ellipse corresponds to a different CPU. R e a l d a t a α β α γ (a) Distribution of α , β , and γ (observations on × CPUs from October to November 2019). S y n t h e t i c d a t a α β α γ (b) Distribution of α , β , and γ (synthetic data for 16 CPUs). Figure 9: Distribution of the regression parameters for around 20 dgemm calibrations made oneach of the 32 nodes. 4 of these nodes had a cooling problem, leading to longer and morevariable durations. Each color/ellipse corresponds to a different group of CPUs. α p seems a bit overestimated (the spread along the x-axis is larger). Thiscan be explained by the fact that one of the nodes seemed to be significantlyslower (with much larger α p ), which artificially increased the spatial variability.Second, as expected from a Gaussian model, the distributions of the β p,d aresymmetrical whereas there was a slight negative skew in the original samplesbut this should be of little significance for our study. The distributions of the γ p,d however are particularly realistic.We also illustrate the generality of this model with the data from Figure 9(a).These measurements were obtained from October to November 2019 where thecluster was less stable and where some nodes particularly misbehaved. Threenodes (in orange, hence a total of 6 CPUs) are distinguished from the 28 oth-ers (in green) and have lower performance (higher values for α , β , and γ ) andone node (in blue) is particularly unstable. Although this last node may beconsidered too abnormal to represent anything, it would be reasonable to as-sume that a larger cluster would present at least the two kinds of behaviors(green for stable nodes, and orange for slower nodes). The higher layer of themodel in Figure 7 should then be replaced by a mixture of normal distributions(whose weights would then be sampled from a Dirichlet distribution). Again,hypothetical regression parameters for 16 CPUs have been generated with sucha process on Figure 9(b) are very similar, although different, to the originalmeasurements.Overall this model is therefore of excellent quality and can be used to gener-ate large configurations very easily and evaluate the influence of different kindsof variability on the performance of HPL. dgemm ’s Temporal Variability In Section 3.4, we could highlight the importance of accounting for temporalvariability of the dgemm kernel to obtain faithful HPL predictions. To the best of23ur knowledge, HPL developers and experts are often aware of this influence (orat least suspect it). However, they have never fully quantified it since design-ing and performing real experiments to evaluate this would be quite difficult.Although increasing this variability wouldn’t be too hard, reducing it would beparticularly complicated. This can however easily be done through simulationusing the hierarchical model of the previous section. In our experiments, the or-der of magnitude of the temporal variability with respect to actual performance(i.e., the ratio between γ p,d and α p,d in Equation (2)) was around 3%. Thismay be a “normal” value or could be considered too high and possibly improvedby better controlling thread mapping or Operating System noise. Such a taskcan be quite tedious and knowing how much performance gain can be expectedbeforehand is thus quite useful. In this section, we study the influence of thisvariability by generating 10 cluster scenarios using the previous model (as inFigure 8), comprising 1,024 nodes each, but by constraining γ p,d to be equalto γ.α p,d with γ ∈ [0 , . , which represents the coefficient of variation of the dgemm kernel. We evaluate the performance of HPL with one multi-threadedMPI rank per node, a block size of 512, a look-ahead depth of 1. We used the increasing-2-ring broadcast with the Crout panel factorization algorithms and P × Q = 8 × and we tested matrix sizes ranging from 100,000 to 500,000. Letus denote by T ( N, C i , γ ) the performance of HPL when factorizing a matrix ofrank N on cluster C i with a temporal variability of γ . The overhead for thisconfiguration is the ratio O ( N, C i , γ ) = E [ T ( N, C i , γ )] T ( N, C i , − . Each bubble in Figure 10 represents one such overhead. For any γ , this overheadappears to be negligible for small matrices and to increase and flatten when N grows large. In most TOP5000 qualification runs, the matrix is made as largeas possible and the overhead would thus appear to grow roughly linearly with γ . On a new cluster, a simple statistical evaluation of the nodes’ performanceusing the model of Section 5.1 would thus be a good first diagnosis of whethertrying to decreasing temporal variability is a promising tuning target or not.24 %2%4%6%8% 2% 4% 6% 8% 10% DGEMM coe ffi cient of variation O v e r head on H P L du r a t i on ( % ) Matrix rank
Figure 10: The overhead on HPL duration appears to be linear in dgemm temporal variability.Although it is negligible for small matrices, it severely inflates for larger matrices.
Although we showed in Section 3.4 that temporal variability could accountfor about 9% of performance, spatial variability was even more important as itwas responsible for 22% of overhead compared to a fully homogeneous cluster.In practice, the replacement of a few nodes may be possible but such spatialvariability is expected and common [12] and a workaround would have to befound. A common approach consists in dropping out a few of the slowest nodes.Indeed, since the matrix is evenly divided between the nodes, the computationinevitably progresses at the speed of the slowest node. However, removing theslowest nodes also decreases the overall processing capability and impacts thevirtual topology’s geometry (the P and Q parameters of HPL). Such adjustmentis often done by trial and error and is all the more tricky as temporal variabilityand uncertainty from real experiments come into play. In this section, we showhow such a subtle trade-off can be studied in simulation.Using the model from Section 5.1, we generate 10 mildly heterogeneous 256node clusters (i.e., where nodes are similar to the ones of our cluster when op-erating in the normal state as in Figure 8(a)) and we study the performanceobtained when removing 1 to 16 of the slowest nodes. When removing nodes, thegeometry should be adjusted depending on how the number of remaining nodes25 ×604×624×63 5×495×505×514×64 1×2401×2521×256 10×2410×25 11×2211×23 12×2012×21 13×1914×18 15×1615×17 16×1516×16 17×15 18×14 19×13 2×1202×1262×128 20×1221×12 22×1123×11 24×1025×10 27×9 3×803×813×823×843×85 31×832×8 5×48 6×406×416×42 7×357×36 8×308×318×32 9×279×28 Number of nodes removed O v e r head on H P L du r a t i on ( % ) Figure 11: Influence of the number of nodes on the performance of HPL. The geometry of thevirtual topology is particularly influent and it appears that P × Q configurations with a small P perform significantly better then those with a larger P . Each configuration is summarizedthrough the average overhead over the 10 clusters and errorbars represent a 95% confidenceinterval. decomposes in prime factors. As observed in Figure 5(b), having P ≈ Q is gener-ally a good idea to reduce the total amount of communication. However it maybe counter-productive for a given broadcast or swap algorithm that serializescommunications. Figure 11 shows the average (over the 10 clusters) overheadfor a matrix of rank 250,000 compared to the best performance obtained usingthe whole cluster. We group the different P × Q decompositions and order themby increasing P . Again, we use the and Binary-exch algorithms, whichare among the best configurations according to the study of Section 4.2. Itappears that the × geometry now achieves the best trade-off between thetotal amount of communications and how well they overlap with each other.The optimal configuration for each number of nodes is boxed in Figure 11. Itreveals that there is not much to gain, probably because of the mild spatialheterogeneity of our cluster, but that optimizing the virtual topology is particu-larly important. Figure 12 investigates how this overhead for the best geometryand node selection also depends on the matrix rank. It appears that in thisscenario, except for very small matrices, removing nodes cannot help improvingperformance. Note that the overhead for × Q configurations with a matrix26 ×604×614×624×63 5×495×505×51 -5%0%5% 1 4 6 8 11 12 16 Number of nodes removed O v e r head on H P L du r a t i on ( % ) Matrix rank
Figure 12: Influence of node removal on performance while taking into account the matrixrank. Due to the mild heterogeneity of these scenarios, evicting nodes brings no benefit. rank 200,000 appears to behave differently from what happens for other matrixsizes. This surprising effect probably arises from a subtle combination of matrixsize and virtual topology. We could indeed observe on our cluster that suchconfigurations had a weakly but significantly worse performance than the otherconfigurations. Such interaction also explains why designing a faithful analyticalmodel of HPL is so difficult and why a full simulation of the whole applicationis generally required. Although absolute performance should be taken with agrain of salt when studying such subtle effects, they are easily overlooked whenconducting real experiments. In this particular small scale mild heterogeneityscenario, there is thus no gain in removing nodes but, as illustrated in Figure 13where we increased the spatial variability by a factor of 4, this may be a relevantapproach. A multimodal spatial heterogeneity (as in Figure 9) would certainlylead to much more significant gains. This sensibility analysis shows how, for agiven supercomputer, a simple statistical evaluation of the spatial heterogeneityallows evaluating whether spatial variability is a promising tuning target or not.
Finally, since virtual topology and communications appear to significantlyinfluence the overall performance, one may wonder how much the physical topol-ogy influences the performance. Indeed, several recent articles [13, 14] reportthat interconnect networks are often oversized compared to the actual need of27 ×604×614×624×63 5×495×505×51 -5%0%5% 1 4 6 8 11 12 16
Number of nodes removed O v e r head on H P L du r a t i on ( % ) Matrix rank
Figure 13: Influence of node removal on performance in a stronger heterogeneity scenario(extrapolation of our test cluster when it had a cooling problem on 4 of its nodes). Removing6 to 12 nodes our of 256 nodes may bring substential improvement and such optimizationwould therefore be worth investigating. applications and that turning off some switches could sometimes go completelyunnoticed by end-users. In this section we consider ten 256 node clusters withvariable node performance (as in Figure 8) interconnected by a 2-level fat-treeand quantify by how much performance degrades when the top-tier switchesare gradually deactivated. More formally, we use a (2;32,8;1,N;1,8) fat-treewith N ∈ { , , , } . Figure 14 depicts this degradation as a function of matrixsize. As one could expect, the impact is more significant for smaller matrixsizes (where the execution is more network bound). Although removing oneswitch leads to absolutely no visible performance loss, removing two or threeswitches can have a dramatic effect. Again, such degradation depends on thebroadcast and swap algorithms and may be slightly mitigated. To the best ofour knowledge, it is the first time such sensibility analysis is conducted faith-fully. Generating random node configurations allows avoiding potential bias, inparticular against perfectly homogeneous scenarios. We believe such a tool canbe quite useful in the earlier steps of a supercomputer design when performingcapacity planning to adjust the network capacity to a given cost and powerenvelope. 28 %10%20%30% 1 2 3 Number of top-level switches removed O v e r head on H P L du r a t i on ( % ) Matrix rank
Figure 14: Influence of the physicical topology on the overall performance. It is possible toremove up to 2 of the top-level switches without significantly hurting performances for largematrices. Beyond this point, communications become the main performance bottleneck.
6. Related Work
A first approach for estimating the performance of applications like HPLconsists in statistical modeling the application as a whole [15]. By running theapplication several times with small and medium problem sizes (of a few iter-ations of large problem sizes) and using simple linear regressions, it is possibleto predict its makespan for larger sizes with an error of only a few percent anda relatively low cost. Unfortunately, the predictions are limited to the sameapplication configuration and studying the influence of the number of rows andcolumns of the virtual grid, or the broadcast algorithms requires a new modeland new (costly) runs using the whole target machine. Our attempts to build ablack-box analytical model (involving, polynomials, inverse, and logarithms of P and Q ) of HPL from a limited set of observations always failed to provide a faith-ful model with decent prediction and extrapolation capabilities. Furthermore,this approach does not allow studying what-if scenarios (e.g., to evaluate whatwould happen if the network bandwidth was increased or if node heterogeneitywas decreased) that go beyond parameter tuning.Simulation provides the details and flexibility missing to such a black-boxmodeling approach. Performance prediction of MPI applications through simu-lation has been widely studied over the last decades but two approaches can be29istinguished in the literature: offline and online simulation.With the most common approach, offline simulation , a trace of the appli-cation is first obtained on a real platform. This trace comprises sequences ofMPI operations and CPU bursts and is given as an input to a simulator thatimplements performance models for the CPUs and the network to derive pre-dictions. Researchers interested in finding out how their application reacts tochanges to the underlying platform can replay the trace on commodity hard-ware at will with different platform models. Most HPC simulators availabletoday, notably BigSim [16], Dimemas [17] and CODES [18], rely on this ap-proach. The main limitation of this approach comes from the trace acquisitionrequirement. Not only is a large machine required but the compressed traceof a few iterations (out of several thousands) of HPL typically reaches a fewhundred MB, making this approach quickly impractical [19]. Worse, tracing anapplication provides only information about its behavior of a specific run: slightmodifications (e.g., to communication patterns) may make the trace inaccurate.The behavior of simple applications (e.g., stencil ) can be extrapolated fromsmall-scale traces [20, 21] but this fails if the execution is non-deterministic,e.g., whenever the application relies on non-blocking communication patterns,which is, unfortunately, the case for HPL.The second approach discussed in the literature is online simulation . Here,the application is executed (emulated) on top of a platform simulator that de-termines when each process is run. This approach allows researchers to studydirectly the behavior of MPI applications but only a few recent simulators suchas SST Macro [2], SimGrid/SMPI [8] and the closed-source xSim [3] support it.To the best of our knowledge, only SST Macro and SimGrid/SMPI are matureenough to faithfully emulate HPL. In this work, we decided to rely on SimGridas its performance models and its emulation capabilities are quite solid but thework we present would a priori also be possible with SST. Note that the HPLemulation we describe in Section 3.2 should not be confused with the applicationskeletonization [22] commonly used with SST and more recently introduced inCODES. Skeletons are code extractions of the most important parts of a com-30lex application whereas we only modify a few dozens of lines of HPL beforeemulating it with SMPI. Some researchers from Intel unaware of our recentwork recently applied the same methodology as the one we proposed in [5] toboth Intel HPL and OpenHPL in the closed-source CoFluent simulator [23].To the best of our knowledge, their work reports two faithful predictions fortwo large-scale supercomputers but without investigating at all the impact ofvariability, heterogeneity, nor of communications as we do in this article. Fi-nally, it is important to understand that the approach we propose is intendedto help studies at the whole machine and application level, not the influence ofmicroarchitectural details as intended by gem5 [24] or MUSA [25].
7. Conclusion
HPC application developers implement many elaborate algorithmic strate-gies (non-blocking collective operations, iteration look-ahead, etc.) whose im-pact on performance is often dependent on both the input workload and thetarget platform. This structure makes it very difficult to model and accuratelyforecast the overall application performance, and many HPC application devel-opers and users are often left with no other option but to study and tune theirapplications at scale, which can be very time- and resource-consuming. We be-lieve that being capable of precisely predicting an application’s performance ona given platform is useful for application developers and users (e.g., to evaluatethe scalability of the application on a given machine) and will become invaluablein the future as it can, for example, help computing centers with deciding whichone of the envisioned technologies for a new machine would work best for agiven application or if an upgrade of the current machine should be considered.Simulation is an effective approach in this context and SimGrid/SMPI haspreviously been successfully validated in several small-scale studies with simpleHPC benchmarks [1, 26]. In an earlier work [5], we have explained how SMPIcould be used to efficiently emulate HPL. The proposed approach only requiresminimal code modifications and applies to any application whose behavior does31ot strongly depend on data-dependent intermediate computation results. Al-though HPL is not a real application, it is quite optimized from an algorithmicpoint of view and its behavior can be controlled through 6 different parameters(granularity, geometry of the virtual topology, broadcast/swapping/factoriza-tion algorithm, and the number of concurrent iterations). HPL features clas-sical optimization techniques such as heavily relying on
MPI_Iprobe to overlapcommunication with computations, making it particularly challenging both interms of tuning and simulation.In this article (Section 3 and 4), we present an extensive validation studywhich covers the whole parameter space of HPL. Our study emphasizes theimportance of carefully modeling (1) the platform heterogeneity (not all nodeshave exactly the same performance), (2) the short-term temporal variability(e.g., system noise) for compute kernels as it may propagate in communica-tion patterns, and (3) the complexity of MPI (performance often wildly differsbetween small and large messages and between intra-node and extra-node com-munications). We show that disregarding any of these aspects may lead to wildlyinaccurate predictions even on an application as regular as HPL. By buildingon a few well-identified micro-benchmarks of the BLAS and MPI, we show thatthese aspects can be well modeled, which allows us to systematically predictthe overall performance of HPL within a few percent. Our experimental resultsspan over two years and we report situations (in Section 3.5 and 4.1) where thesimulation helped us to identify performance regression or anomalies incurredby the platform when the prediction did not match the real experiments.We show (in Section 4) how this faithful surrogate can be used to evaluate thesignificance of application parameters and tune them accordingly solely throughsimulations. We also propose a generative model for the compute nodes’ per-formance that can easily be fit from daily measurements and used to producesynthetic platforms similar to the ones at hand. We show (in Section 5) howthis model, which allows us to easily control temporal and spatial variability,can feed our simulations to assess the impact of variability on the performanceof the application or of mitigation strategies (e.g., the eviction of the slower32odes). Likewise, the simulation allows to easily assess the influence of thephysical network on the overall performance. Most of these what-if studieswould be particularly difficult to conduct through real experiments because ofthe difficulty to finely control the platform. This is to the best of our knowledgeone of the first sensitivity analyses of a real HPC code accounting for platformuncertainty.As future work, building on the effort of SimGrid developers on support-ing the emulation of a wide variety of applications with SMPI [27], we alsointend to conduct similar studies with other HPC benchmarks (e.g., HPCG [28]or HPGMG [29]), real applications (e.g., BigDFT [30]) and larger infrastruc-tures. As explained in this article, a good model of compute kernels and theMPI library is essential. Thereby, the main challenge for systematic use ofour simulation technique now lies in the automation of measurements throughwell-designed experiments and the automatic detection of when the envisionedmodels miss essential characteristics of the platform (multi-modal behaviors,heteroscedasticity, discontinuities,. . . ). We intend to provide a fully automaticcalibration procedure for MPI as well as for every BLAS function, which wouldallow us to effortlessly predict the performance of many applications by simplylinking against a BLAS-replacement library.
Acknowledgments eferences [1] A. Degomme, A. Legrand, G. Markomanolis, M. Quinson, M. S. Still-well, F. Suter, Simulating MPI applications: the SMPI approach, IEEETransactions on Parallel and Distributed Systems 28 (8) (2017) 2387–2400. doi:10.1109/TPDS.2017.2669305 .[2] C. L. Janssen, H. Adalsteinsson, S. Cranford, J. P. Kenny, A. Pinar,D. A. Evensky, J. Mayo, A simulator for large-scale parallel architectures,International Journal of Parallel and Distributed Systems 1 (2) (2010)57–73, http://dx.doi.org/10.4018/jdst.2010040104 . doi:10.4018/jdst.2010040104 .[3] C. Engelmann, Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road ToExascale, FGCS 30 (2014) 59–65.[4] C. Engelmann, T. Naughton, A network contention model for the extreme-scale simulator, in: Proceedings of the th IASTED International Confer-ence on Modelling, Identification and Control (MIC) 2015, ACTA Press,Innsbruck, Austria, 2015. doi:10.2316/P.2015.826-043 .[5] T. Cornebize, A. Legrand, F. C. Heinrich, Fast and Faithful PerformancePrediction of MPI Applications: the HPL Case Study, in: 2019 IEEE In-ternational Conference on Cluster Computing (CLUSTER), Albuquerque,United States, 2019. doi:10.1109/CLUSTER.2019.8891011 .URL https://hal.inria.fr/hal-02096571 [6] H. W. Meuer, E. Strohmaier, J. Dongarra, H. D. Simon, The TOP500:History, Trends, and Future Directions in High Performance Computing,1st Edition, Chapman & Hall/CRC, 2014.[7] A. Petitet, C. Whaley, J. Dongarra, A. Cleary, P. Luszczek, HPL - aportable implementation of the High-Performance Linpack benchmark for34istributed-memory computers, , ver-sion 2.2 (February 2016).[8] H. Casanova, A. Giersch, A. Legrand, M. Quinson, F. Suter, Versatile, Scal-able, and Accurate Simulation of Distributed Applications and Platforms,Journal of Parallel and Distributed Computing 74 (10) (2014) 2899–2917.[9] P. Velho, L. Schnorr, H. Casanova, A. Legrand, On the Validity of Flow-level TCP Network Models for Grid and Cloud Simulations, ACM Trans-actions on Modeling and Computer Simulation 23 (4) (2013) 23.[10] A. Denis, A High-Performance Superpipeline Protocol for InfiniBand, in:E. Jeannot, R. Namyst, J. Roman (Eds.), Euro-Par 2011, Vol. 6853 ofLecture Notes in Computer Science, Springer, Bordeaux, France, 2011, pp.276–287.URL https://hal.inria.fr/inria-00586015 [11] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Be-tancourt, M. Brubaker, J. Guo, P. Li, A. Riddell, Stan: A probabilis-tic programming language, Journal of Statistical Software 76 (1) (2017). doi:10.18637/jss.v076.i01 .URL https://doi.org/10.18637%2Fjss.v076.i01 [12] Y. Inadomi, T. Patki, K. Inoue, M. Aoyagi, B. Rountree, M. Schulz,D. Lowenthal, Y. Wada, K. Fukazawa, M. Ueda, M. Kondo, I. Miyoshi,Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing, in: SC ’15: Proceedings of the InternationalConference for High Performance Computing, Networking, Storage andAnalysis, 2015, pp. 1–12. doi:10.1145/2807591.2807638 .[13] E. A. León, I. Karlin, A. Bhatele, S. H. Langer, C. Chambreau, L. H.Howell, T. D’Hooge, M. L. Leininger, Characterizing parallel scientific ap-plications on commodity clusters: An empirical study of a tapered fat-tree,35n: SC ’16: Proceedings of the International Conference for High Perfor-mance Computing, Networking, Storage and Analysis, 2016, pp. 909–920. doi:10.1109/SC.2016.77 .[14] P. Taffet, S. Rao, E. León, I. Karlin, Testing the limits of tapered fat treenetworks, in: 2019 IEEE/ACM Performance Modeling, Benchmarking andSimulation of High Performance Computer Systems (PMBS), 2019, pp.47–52. doi:10.1109/PMBS49563.2019.00011 .[15] P. Luszczek, J. Dongarra, Reducing the time to tune parallel dense lin-ear algebra routines with partial execution and performance modeling, in:Parallel Processing and Applied Mathematics, Springer Berlin Heidelberg,2012, pp. 730–739. doi:10.1007/978-3-642-31464-3_74 .URL https://doi.org/10.1007%2F978-3-642-31464-3_74 [16] G. Zheng, G. Kakulapati, L. Kale, BigSim: A Parallel Simulator for Per-formance Prediction of Extremely Large Parallel Machines, in: Proc. of the18th IPDPS, 2004.[17] R. M. Badia, J. Labarta, J. Giménez, F. Escalé, Dimemas: Predicting MPIApplications Behaviour in Grid Environments, in: Proc. of the Workshopon Grid Applications and Programming Tools, 2003.[18] M. Mubarak, C. D. Carothers, R. B. Ross, P. H. Carns, Enabling paral-lel simulation of large-scale HPC network systems, IEEE Transactions onParallel and Distributed Systems (2016).[19] H. Casanova, F. Desprez, G. S. Markomanolis, F. Suter, Simulation of MPIapplications with time-independent traces, Concurrency and Computation:Practice and Experience 27 (5) (2015) 24. doi:10.1002/cpe.3278 .URL https://hal.inria.fr/hal-01232776 [20] X. Wu, F. Mueller, ScalaExtrap: Trace-based communication extrapolationfor SPMD programs, in: Proc. of the 16th ACM Symp. on Principles andPractice of Parallel Programming, 2011, pp. 113–122.3621] L. Carrington, M. Laurenzano, A. Tiwari, Inferring large-scale computationbehavior via trace extrapolation, in: Proc. of the Workshop on Large-ScaleParallel Processing, 2013.[22] R. F. Bird, S. A. Wright, D. A. Beckingsale, S. A. Jarvis, Performancemodelling of magnetohydrodynamics codes, in: M. Tribastone, S. Gilmore(Eds.), Computer Performance Engineering, Springer Berlin Heidelberg,Berlin, Heidelberg, 2013, pp. 197–209.[23] G. Xu, H. Ibeid, X. Jiang, V. Svilan, Z. Bian, Simulation-based per-formance prediction of hpc applications: A case study of hpl, in: 2020IEEE/ACM International Workshop on HPC User Support Tools (HUST)and Workshop on Programming and Performance Visualization Tools (Pro-Tools), 2020, pp. 81–88. doi:10.1109/HUSTProtools51951.2020.00016 .[24] J. Lowe-Power, A. M. Ahmad, A. Akram, M. Alian, R. Amslinger, M. An-dreozzi, A. Armejach, N. Asmussen, B. Beckmann, S. Bharadwaj, G. Black,G. Bloom, B. R. Bruce, D. R. Carvalho, J. Castrillon, L. Chen, N. Deru-migny, S. Diestelhorst, W. Elsasser, C. Escuin, M. Fariborz, A. Farmahini-Farahani, P. Fotouhi, R. Gambord, J. Gandhi, D. Gope, T. Grass,A. Gutierrez, B. Hanindhito, A. Hansson, S. Haria, A. Harris, T. Hayes,A. Herrera, M. Horsnell, S. A. R. Jafri, R. Jagtap, H. Jang, R. Jeyapaul,T. M. Jones, M. Jung, S. Kannoth, H. Khaleghzadeh, Y. Kodama, T. Kr-ishna, T. Marinelli, C. Menard, A. Mondelli, M. Moreto, T. Mück, O. Naji,K. Nathella, H. Nguyen, N. Nikoleris, L. E. Olson, M. Orr, B. Pham,P. Prieto, T. Reddy, A. Roelke, M. Samani, A. Sandberg, J. Setoain,B. Shingarov, M. D. Sinclair, T. Ta, R. Thakur, G. Travaglini, M. Up-ton, N. Vaish, I. Vougioukas, W. Wang, Z. Wang, N. Wehn, C. Weis, D. A.Wood, H. Yoon, Éder F. Zulian, The gem5 simulator: Version 20.0+ (2020). arXiv:2007.03152 .[25] T. Grass, C. Allande, A. Armejach, A. Rico, E. Ayguadé, J. Labarta,37. Valero, M. Casas, M. Moreto, Musa: A multi-level simulation approachfor next-generation hpc machines, in: Proceedings of the International Con-ference for High Performance Computing, Networking, Storage and Anal-ysis, SC ’16, IEEE Press, Piscataway, NJ, USA, 2016, pp. 45:1–45:12.URL http://dl.acm.org/citation.cfm?id=3014904.3014965 [26] F. C. Heinrich, T. Cornebize, A. Degomme, A. Legrand, A. Carpen-Amarie,S. Hunold, A.-C. Orgerie, M. Quinson, Predicting the Energy Consumptionof MPI Applications at Scale Using a Single Node, in: Proc. of the 19thIEEE Cluster Conference, 2017.URL https://hal.inria.fr/hal-01523608 [27] S. Developers, Continuous integration of ECP Proxy Apps, CORALand Trinity-Nersc benchmarks with SimGrid/SMPI, https://github.com/simgrid/SMPI-proxy-apps/https://github.com/simgrid/SMPI-proxy-apps/