Improving the Performance and Resilience of MPI Parallel Jobs with Topology and Fault-Aware Process Placement
IImproving the Performance and Resilience ofMPI Parallel Jobs with Topology andFault-Aware Process Placement
Ioannis Vardas, Manolis Ploumidis, and Manolis Marazakis
Institute of Computer Science (ICS), { vardas, ploumid, maraz } @ics.forth.gr Abstract.
HPC systems keep growing in size to meet the ever-increasingdemand for performance and computational resources. Apart from in-creased performance, large scale systems face two challenges that hinderfurther growth: energy efficiency and resiliency. At the same time, appli-cations seeking increased performance rely on advanced parallelism forexploiting system resources, which leads to increased pressure on systeminterconnects. At large system scales, increased communication localitycan be beneficial both in terms of application performance and energyconsumption. Towards this direction, several studies focus on derivinga mapping of an application’s processes to system nodes in a way thatcommunication cost is reduced. A common approach is to express boththe application’s communication patterns and the system architecture asgraphs and then solve the corresponding mapping problem. Apart fromcommunication cost, the completion time of a job can also be affectedby node failures. Node failures may result in job abortions, requiring jobrestarts. In this paper, we address the problem of assigning processes tosystem resources with the goal of reducing communication cost while alsotaking into account node failures. The proposed approach is integratedinto the Slurm resource manager. Evaluation results show that, in scenar-ios where few nodes have a low outage probability, the proposed processplacement approach achieves a notable decrease in the completion timeof batches of MPI jobs. Compared to the default process placement ap-proach in Slurm, the reduction is 18.9% and 31%, respectively for twodifferent MPI applications.
Keywords:
Topology and fault-aware process placement · Performanceand Resilience of MPI parallel jobs · Slurm resource manager extensions.
There is a large number of applications, from a wide range of fields, such as,molecular dynamics, astronomy and astrophysics, weather forecast and climatemodeling, that address complex problems requiring a huge amount of computa-tional resources. Such applications rely on parallel machines and programmingmodels that allow harvesting their resources. The MPI standard [25] is the dom-inant realization of such a parallel programming model. a r X i v : . [ c s . D C ] J a n Ioannis Vardas, Manolis Ploumidis, and Manolis Marazakis
There is a wide range of different approaches for improving the performanceof MPI applications including topology or machine-aware collective primitives[42,21,33], hardware assistance for certain MPI primitives [28,2,19] and point-to-point primitives, tuned for RDMA-capable networks [30,20,26]. One approachthat has received significant attention tackles the following problem: given anapplication with multiple processes that exhibit a specific communication pat-tern and a parallel machine with several nodes, how should processes be laid onthe available resources so that some criteria are optimized. The MPI standardoffers capabilities for exploiting topology related information, however, leverag-ing such information for optimizing communications is either delegated to theMPI implementation, or the resource manager.A common approach for addressing the aforementioned process placementproblem, is to assign system resources to processes with a joint goal of mini-mizing communication cost while achieving a fair load balance among systemnodes. Assigning processes with a heavy communication profile at nearby nodesis expected to improve overall application running time. Improving communica-tion locality has also been explored in the context of other performance criteria.Power consumption of HPC systems has been pinpointed as an obstacle towardsExascale [18]. Suitable mapping of processes to system resources can increasecommunication locality which can in turn, reduce energy consumption due tothe interconnect [15,18]. The work in [38,15] also explores the potential to reducenetwork congestion.The main contribution of this work is a process placement approach thataims at improving two aspects of MPI job completion time. The first goal is toreduce the communication cost incurred due to inter-node traffic. The secondgoal of the proposed process placement approach is, to reduce the overhead ofa job abort due to node failures. Note that, a node failure may result in a callto an MPI primitive returning an error. The default handling, provisioned bythe MPI standard, for such cases, is abortion of the job [25]. The cost of a jobbeing aborted during execution becomes even more profound for codes that solvecomplex problems and may require running for days.Several studies have outlined the effect of system failures on resource uti-lization. Authors in [10] report that in a large scale HPC system, 20% or moreof the computing resources are wasted due to failures and recoveries. For oneof Google’s multipurpose clusters, it was found that a large fraction of time isspent for jobs that do not complete successfully [9]. In [23] authors show thatsystem related errors cause an application to fail once every 15 minutes. Whatis more, failed applications, although few in number, account for approximately9% of total production hours. Authors in [34] examine node failure rate in thedataset collected during 1995–2005 at LANL. The number of failures per yearper system can be as high as 1100 meaning that an application requiring the fullcluster is expected to fail more than two times per day.For reducing the communication cost of an MPI job, the proposed approachaims at placing ranks with a heavy communication profile at nearby nodes (withrespect to topological distance). This assumes that a training run of the MPI itle Suppressed Due to Excessive Length 3 job is performed that has allowed to extract that job’s communication profile.For extracting the topological distance between any two nodes of the parallelmachine, topology information is also assumed to be available. The correspond-ing topology mapping problem is solved using the Scotch graph mapping library[27,35].The second contribution is the integration of the proposed process placementapproach into
Slurm . Slurm is the resource manager and job scheduler for about60% of the machines in the Top 500 list [1]. For the evaluation of the proposedapproach, batches of different MPI jobs where simulated in the SimGrid [6] envi-ronment, which is a distributed computer system simulator. Simulated scenariosreveal that our placement approach achieves a notable decrease in batch com-pletion time, when compared to the default placement approach of Slurm. Inscenarios where approximately 3% of the nodes exhibited an outage probabil-ity of 2%, the corresponding improvement over Slurm’s policy was 18 .
9% and31%, respectively, for the two MPI applications tried. Additionally, the proposedapproach manages to reduce the job instances that are aborted due to node fail-ures.The rest of the work is organized as follows: Section 2 provides a more de-tailed discussion of related studies, while Section 3 presents the proposed processplacement approach. In Section 4, details about the integration of the proposedprocess placement approach in Slurm are provided. Evaluation results are pre-sented in Section 5 while conclusions and future work are discussed in Section 6.
There is a wide range of different approaches, concerning different stack layers,for improving the performance achieved by MPI applications. There are studiesrelated to hardware, systems software like resource managers, the MPI libraryand the application itself. A taxonomy of related studies can be derived byconsidering the type of feedback they use, that is, feedback related to the systemarchitecture, or topology and information regarding the communication patternor profile of a specific job.Example of studies that use neither topology, nor application communica-tion profile related feedback are those that focus on optimizing collective orpoint-to-point primitives. Work in [32] for example, suggests a new implementa-tion for the Allreduce primitive, that achieves reduced execution time over thewell known recursive doubling algorithm. Several studies, focus on optimizingor tuning, the messaging protocol employed by point-to-point primitives, for thecase where RDMA capabilities are offered by the platform [30,20,26]. Work in[28,2,19], either introduces or, utilizes hardware support for speeding up certainMPI primitives.There are approaches that utilize knowledge regarding the topology or systemarchitecture. Such approaches focus on deriving topology aware, or, improvedimplementation of collective primitives for specific platforms [42,21,33]. Authors
Ioannis Vardas, Manolis Ploumidis, and Manolis Marazakis in [33] for example, suggest non-minimal algorithms for allgather and reduce-scatter primitives for the case of Clos, single-, and multi-ported torus networks.Studies that are mostly related to our work, employ as input a graph thatcaptures a problem’s communication patterns along with a graph that mod-els the architecture of the platform available. Most of them address the map-ping problem described in [4], which seeks to assign processes of an applicationonto computational resources of the available platform. Authors in [15] for ex-ample, address the problem of mapping arbitrary communication topologies toarbitrary-heterogeneous machine topologies with the goal of minimizing the max-imum congestion and average dilation. The work in [37] suggests a new algorithmfor reordering ranks of an MPI jobs so that communication cost is minimized. In[31], the dual recursive bi-partitioning method is explored for solving the graphmapping problem, for the case of clusters with multi-core processors. Authorsin [40] show that, for the case of hierarchical systems especially, the topologymapping problem can be solved through weighted graph partitioning. They dis-cuss the properties of a new Kernighan-Li heuristic for providing direct graphk-partitioning. Authors in [16] explore an enhanced version, in terms of complex-ity, of the TreeMatch algorithm [17] for deriving an optimized process placementon the platform available. The work in [11] discusses an efficient algorithm basedon the Kernighan-Li heuristic for assigning tasks of a parallel program to pro-cessors for the case of hyper-cube parallel computers.A similar approach formulates the problem of assigning processes onto systemresources as a
Quadratic Assignment Problem (QAP) [4]. Authors in [38] forexample, formulate the assignment of processes to nodes as a QAP. They suggesta heuristic based on graph partitioning and the greedy randomized adaptivesearch for solving the corresponding problem.The approaches discussed so far are profile guided, that is, they assume thatthe underlying communication pattern of a specific application is available. Thereare approaches though, that do not carry this dependence. Authors in [24] for ex-ample, explore four heuristics, to perform rank reordering for realizing run-timetopology awareness, for the case of the MPI Allgather primitive. The corre-sponding approach does not rely on an application’s profile. Instead, it is basedon the communication pattern exhibited by each algorithm used in the allgatherprimitive. The work in [36] targets mapping the logical communications impliedby broadcast tree to the physical network topology in order to reduce delays atcritical phases of the broadcast schedule.The work in [11] discusses an approach that is is complementary to that ofassigning processes of a parallel job onto platform resources. More precisely, aframework is suggested that rearranges the logical communication of broadcastand allgather operations taking into account process distances instead of processranks along with topology information.For mitigating the impact of failures, failure awareness has been integratedin several different approaches including checkpointing [14,41,39], schedulingmethods [8,13] and methods for resource allocation and resource management[43,12,22]. itle Suppressed Due to Excessive Length 5
In this work, we consider an approach for assigning process of an MPI job ontonodes of a given platform, named,
TOpology and Fault Aware (TOFA) processplacement approach. We focus on MPI applications with a static profile, that is,applications where processes coexist for the whole duration of the execution. Inaccordance with relevant studies [15,40], we formulate the problem of assigningprocess to platform nodes as a topology mapping problem. We model the com-munication among different process as an undirected graph G = ( V G , E G ). Forthe rest of the study, this graph will be referred to as the communication graph .Each vertex v g ∈ V G corresponds to one process while an edge e ∈ E G connectingvertices u g and v g denotes communication between the corresponding processes.Edge weights may depict either number of messages or total traffic exchangedbetween the two processes. As it stated in [5] the choice between volume andnumber of messages is not standard but rather application depended. Thus, eachapplication should be tested before choosing the best way of depicting the edgeweight of the communication graph. After testing both cases, the evaluation re-sults presented in Section 5 are derived by considering total traffic volume asedge weights.For extracting the number of messages, and total bytes exchanged betweeneach pair of processes, we have implemented a custom profiling tool for MPIapplications. This tool is a dynamically linked library that intercepts all callsto MPI primitives that initiate traffic, such as, point-to-point, collective, andone-sided ones. The output produced consists of two graphs, namely, G v and G m . Each of these graphs is represented as a matrix of size N × N , where N isthe number of processes involved in the MPI program. Graph G v captures thenumber of bytes exchanged for each pair of processes, while graph G m , capturesthe corresponding number of messages. For the case of G v for example, element G v ( i, j ) captures the sum of the bytes sent from MPI rank i to rank j and thebytes sent from j to i . For the case of collective primitives, the profiling tool istuned to emulate the appropriate algorithm for each collective. In this way, it isable to accurately capture the traffic exchanged between each pair of processesduring each phase of that collective’s schedule. Graph G v , or G m , is used todenote the guest graph G , mentioned in the previous paragraph. An additionalfeature of this profiling tool is that it records traffic through communicatorsother than the default one. For correctly updating G v and G m , the rank of aprocess in a communicator other than MPI COMM WORLD , is transformedto the rank in
MPI COMM WORLD ( R comm world ), and counters in G v and G m are updated based on R comm world . Another feature of our profiling tool isthat it produces a traffic heatmap, which depicts the amount of bytes exchangedbetween each process pair. This traffic heatmap allows for visual inspection ofthe corresponding application’s communication pattern. This inspection offersinsight about regularity or irregularity of that pattern. Figures 1a and 1b, depictthe traffic heatmap produced by our profiling tool, for the case of a LAMMPS[29] run and the DT NAS parallel benchmark (NPB) [3] respectively. The darkerthe data point, the higher the amount of traffic exchanged for the corresponding Ioannis Vardas, Manolis Ploumidis, and Manolis Marazakis process pair. As Figure 1a shows, LAMMPS exhibits a more regular communi- (a) (b)
Fig. 1: a)LAMMPS run with 128 processes, b)NPB-DT class C run with 85processescation pattern with traffic points being close to the main diagonal on a similarmanner for all processes.It should be noted that, the proposed process placement approach is profile-guided. This means that, for deriving an assignment of MPI processes to nodes,a training run should be carried first, to derive the corresponding communica-tion graph. However, this cost can be amortized by performing multiple runs ofthe same application using the same input or configuration. It should be notedthough that, the overhead and the dependence on the application profile canbe avoided by utilizing the virtual topology provisioned by the MPI standard.Virtual topologies can be adopted by MPI applications to represent the commu-nication pattern of that application. Nodes in that virtual topology correspondto processes and edges connect processes that communicate.The underlying platform is modeled through topology graph H = ( V H , E H ).Each vertex v h ∈ V H corresponds to one node. Vertices carry no weight. Let R denote the routing logic of the underlying platform. Assuming a 3D torustopology with fixed routing, the routing function R ( u, v ) provides the list oflinks to be traversed by a message sent from node u to v . The weight w ( e u,v ) of e ( u, v ) ∈ E H is set to be the number of hops traversed to reach v when startingfrom v .A key characteristic of the proposed process placement approach from is that,assignment of processes to nodes also takes into account node failures. The faultmodel assumed is the following; nodes may fail independently of each other. Withthe term failure , we refer to any hardware- or software-related event, or reboot,that may constitute the node temporarily unavailable. We further assume that itle Suppressed Due to Excessive Length 7 a node restart is enough to fix transient failures and that restart takes placeinstantaneously, i.e. we disregard recovery time. When a node enters the failedstate, it is incapable of performing both computation and communication, i.e.cannot send, receive, or forward traffic on behalf of other nodes. Consequently,communication attempts initiated by the MPI library will result in error and,in turn, job abortion. Moreover, when a node is in the failed state, it is notable to respond to probes ( heartbeats ) aimed at inferring its availability. In thisway, the corresponding Slurm module that collects heartbeats is able to infernode outage. Finally, we assume that there is no checkpoint/restart mechanism;thus, after a node fails any affected application restarts from scratch. Detailsregarding the heartbeat mechanism will be deferred until Section 4.Avoiding nodes that fail frequently is expected to decrease the probabilityof a job being aborted. The importance of node failures becomes even moreprominent for applications that have large running times. However, a strict policyof avoiding failed nodes may force selecting a more sparse subset of availablenodes. This in turn may have an adverse effect on communication cost. There isthus a trade-off, between the abort ratio tolerated and the communication cost.Therefore, a placement approach with some tolerance for node failures may strikea more favourable balance between abort ratio and completion time for a batchof jobs.With TOFA, we approximate the effect of node failures on the cost of travers-ing a path by assigning larger weights to paths that include nodes with a non-zerooutage probability. Our initial choice of small increases in the path cost revealedthat this approach achieved only a marginal decrease in the probability of abort-ing a job. For that reason, we changed our approach for updating the cost oftraversing paths with failing nodes as follows. Assume that outage probabilityis available for each node. From the routing function R ( u, v ) we infer the listof links to be traversed by a message sent from u ∈ V H to v ∈ V H , where H corresponds to the topology graph. For each link l ∈ R ( u, v ), l s and l d denote theorigin and target of that link respectively. Using this information, we maintaina registry, where input is a node id and output is the list of paths with this nodeserving as an intermediate hop. Combining information provided by the routingfunction and node outage probabilities, we update edge weights in the topologygraph using Equation 1. w ( e u,v ) = (cid:88) l ∈ R ( u,v ) c + c × × [( p fl s > ∨ ( p fl d > , (1)In Equation 1 p fl s and p fl d are the failure probabilities for nodes l s and l d respec-tively, and the constant c denotes the cost in terms of number of hops. Moreover, [ p fl s >
0] is an indicator function with a value of 1 if p fl s >
0. Our rationale isthat if either of the two nodes involved in a link has an outage probability otherthan zero, then the cost of that link is set to 100 instead of one. Thus, the cost ofa failed path becomes significantly higher than the cost of traversing the longestpath (in terms of number of hops) on the platform.
Ioannis Vardas, Manolis Ploumidis, and Manolis Marazakis
Having described how the communication and topology graphs are popu-lated, the final step is to describe the proposed process placement approach.TOFA uses the Scotch graph mapping library [27,35] to solve the correspond-ing graph mapping problem. The output of TOFA is a mapping of vertices(processes) of communication graph G, onto vertices on the topology graph H(corresponding to platform nodes). Listing 1.1 describes in pseudocode TOFA’sprocess placement steps.Listing 1.1: TOFA process placement approach Input: G Graph modeling an application ’s communication pattern Input: H Graph resembling the topology . Edge weights estimated through Equation 1 or 2 Output : T T =
Slurm is the acronym for
Simple Linux Utility and Resource Manager . It is theresource manager and job scheduler for 60% of the systems in the list of Top500[1]. Slurm manages resources, such as, cores, CPUs, nodes, and memory. Thereis also support for the so-called generic resources (GRESs) , such as, GraphicsProcessing Units (GPUs), and CUDA Multi-Process Service (MPS). itle Suppressed Due to Excessive Length 9
Slurm’s components can be broadly categorized in two sets. The first setcontains the main daemons, that is, Slurm controller and Slurm daemon denotedas slurmctld and slurmd respectively. Slurmctld runs on a single node and isresponsible for allocating resources and scheduling jobs.
Slurmd is the daemonthat runs on each compute node and waits to execute work issued by slurmctld .The second set contains user and administration tools like, srun , sbatch , squeue ,e.t.c. Srun and sbatch are used to initiate jobs while squeue is used to viewinformation about jobs located in the scheduling queue.One powerful feature of Slurm is that its functionality can be extendedthrough plugins. The topology plugins for example, are used to provide feed-back regarding the system topology, enabling topology-optimized resource selec-tion. The resource selection plugins determine how resources are allocated to ajob. The generic resources plugin (GRES) can be used to manage non-standardor custom resources. One plugin that was valuable for integrating TOFA intoSlurm is SPANK , which stands for
Slurm Plugin Architecture for Node and job(K)control . It is a generic interface for plugins which can be used to dynamicallymodify the job launch process.
Node A SlurmctldSCOTCH Library Fault AwareNode SelectionPluginFATTopologyPluginFault AwareSlurmctldPlugin NodeStateSpank Plugin SlurmdLoadMatrixSpank Plugin SrunNode B
Hb(i) Hb(t,i)R(u,v)H + GMapping
MPI profiling toolgraph mapping software added Plugins extending Slurm SLURM components MPI Profiling Tool
Gsrun task -matrix=G
Fig. 2: Plugins and components for integrating TOFA in SlurmFor integrating TOFA into Slurm, five different plugins were created alongwith modifications of its source code. Our goal is to enable Slurm to utilizeTOFA only when it is requested by the user while not interfering with the stan-dard resource allocation path of Slurm. The
NodeState and
LoadMatrix
SPANKplugins run in every compute node, whereas
FAT Topology , Fault Aware Slurm-ctld and
Fault Aware Node Selection (FANS) plugins run only on the controllernode. Figure 2 overviews the integration of TOFA into Slurm.The
Fault Aware Slurmctld plugin, is responsible for periodic polling of eachnode through a heartbeat. In Figure 2, the heartbeat of node i at internal t is denoted as Hb ( t, i ). Absence of a reply to a heartbeat is translated as nodeoutage. Slurmctld maintains a record of heartbeats for each node i , denoted as HB ( i ). Node outage probability can be inferred by post-processing the historyof each node’s heartbeats. Then, node outage probabilities can be combined withoutput from the routing function R ( u, v ), according to Equation 1, to updateedge weights in the topology graph. This plugin is also the context where differentpolicies for inferring node outage probability can be implemented. One suchpolicy could be a moving or weighted moving average.The NodeState plugin, is a SPANK plugin located on each node, and is runonce during the initialization of the slurmd daemon. This plugin is responsiblefor replying to the heartbeats sent by the
Fault Aware Slurmctld that is runningon the controller node.The
LoadMatrix plugin in Figure 2, is a SPANK plugin used to send thecommunication graph G from any compute node to the controller node. Recallthat, the communication graph is produced by the MPI profiling tool described inSection 3. This plugin enables srun to have an extra argument which can be usedto provide the file containing a representation of G . Information regarding thecommunication graph G will be sent to slurmctld where the actual assignmentof processes to nodes will take place.The Fault Aware Torus Topology (FATT) plugin is responsible for imple-menting routing function R ( u, v ). Additionally, it provides a representation ofthe platform topology. This representation is in the form of a graph and doesnot take node outage probabilities into account. The creation of this graph takesplace during slurmctld initialization. This plugin reads a topology file whichcontains one entry for each node. This entry includes the id of the node alongwith x , y , and z coordinates on the 3D torus assumed. Using this information, FATT plugin realizes the routing function R ( u, v ). This function provides allintermediate nodes traversed for routing traffic from a node u to a node v . Thisinformation is required by the plugin Fault Aware Node Selection , which is theplugin that performs the actual allocation of resources. Although Slurm offersa topology plugin suitable for 3D Torus, it cannot be utilized by our processplacement approach since it does not export routing related information similarto routing function R ( u, v ).The core functionality of resource selection is implemented by the FaultAware Node Selection plugin which uses the following information as input: – Communication graph G from the LoadMatrix plugin – Distance and intermediate nodes required for reaching node v from u andvice versa. This information is provided by the FATT plugin – Outage probability for each node which is the result of processing Hb datastructures at the Fault Aware Slurmctld pluginThe main task performed by the aforementioned plugin is to integrate Scotchlibrary’s functionality. More precisely, it invokes Scotch providing as input to itthe application’s communication graph G and the fault-aware topology graph H . Scotch in turn solves the actual graph mapping problem and returns as itle Suppressed Due to Excessive Length 11 output an array with one entry per process. Each entry has the id of the cor-responding process and also the id of the node where that process is assigned.To ensure that each task will be executed on the node allocated through theaforementioned process, Slurm’s default task layout process should be overiden.For achieving this, limited modifications to srun, sbatch and the Slurmtctld stepmanager were required. First, we added support for a new value for srun’s ”dis-tribution” parameter. An srun command issued with ‘distribution=TOFA‘ anda file resembling the applications communication graph will enable Slurm tospawn each task on the node selected by our resource allocation apporach. In this section we present the evaluation of the proposed process placementapproach. As already stated, TOFA relies on Scotch for solving the correspondinggraph mapping problem to derive a layout of MPI processes on platform nodes.The first part of the evaluation, in Section 5.1, focuses on assessing the quality(in terms of application performance) of the mapping produced by Scotch. Inthat part, node failures are disregarded, in other words, all nodes are assumed tohave zero outage probability. The evaluation of the proposed process placementapproach is finally presented in Section 5.2.For the evaluation of TOFA we rely on the SimGrid [6] framework for thesimulation of applications that execute on distributed systems. Within the Sim-Grid framework, the SMPI [7] interface is capable of simulating unmodified MPIapplications. In SMPI, the communication calls of the application are interceptedand simulated, whereas the computations are carried out on the host machine.Having a simulated environment for evaluation purposes offers several ben-efits. The main one is that it offers a convenient way to emulate node failures.Emulating or injecting faults in a real platform is more complicated. Secondly,it allows to run multiple scenarios in parallel. Finally, it allows to experimenteasily with different topologies and explore their effect on the process placementapproach. With a real platform, this might not be so trivial.For running MPI applications in the SimGrid environment, a description ofthe simulated platform is needed. The main components of a simulated platformare: links between nodes, nodes and routes. A node is characterized througha fixed computing capability, in terms of floating point operations per second(FLOPS) . In our case, it is fixed to 6 Gflops. Links are characterized by thebandwidth and latency, which are fixed to 10 Gbps and one usec, respectively.Bandwidth was intentionally set to a moderate value, since each simulated sce-nario has a limited duration. High bandwidth values would mask out the effectof communication cost on job completion time. The platform description that isfed to SimGrid also lists the route for each pair of nodes. In this way, we ensurethat the topology simulated matches exactly the topology assumed for derivingthe mapping of processes to platform nodes.
In this section we assess the efficiency of the Scotch library in producing process-to-node mappings that improve performance. We compare Scotch with the fol-lowing process placement approaches: a) default-slurm , b) random , and c) greedy .The default-slurm approach refers to the default policy employed by Slurm.The random scheme randomly picks the node on which each process will beassigned. Finally, the
Greedy placement approach, sorts all different process pairsin terms of total traffic exchanged. Then, it iterates over all pairs, starting fromthe one with the higher volume. The goal is to place the processes involved asclose as possible starting from a distance of one hop.For the evaluation purposes of this section we rely on two benchmarks:LAMMPS [29] and NPB-DT from the NAS parallel benchmark suite [3]. ForNPB-DT we focus on the class C variant that involves 85 processes. LAMMPSis a state-of-the-art molecular dynamics code, while DT falls in the subcategoryof unstructured computation, parallel I/O, and data movement.The rationale for selecting LAMMPS and NPB-DT for the evaluation is tocapture three key parameters that affect the performance of similar resource al-location approaches: the communication to computation ratio, the mix of point-to-point and collective communication, and the communication pattern. Ourapproach minimizes the communication cost; thus in applications where compu-tation outweighs the communication, the expected speedup is insignificant. Bothbenchmarks spend a significant fraction of their execution time for communica-tion. Additionally, in collective communication there are no specific pairs withremarkably more traffic so there is little room for minimizing the communica-tion. NPB-DT is dominated by point-to-point traffic while LAMMPS exhibits asignificant amount of collective traffic.These two benchmarks exhibit different communication patterns. As alsoshown in Figure 1a, LAMMPS exhibits a more regular communication pattern.Data points in the corresponding traffic heatmap are arranged on a uniformmanner around the main diagonal. Each process in LAMMPS i mostly commu-nicates with processes that lie in the range [ i − k, i + k ] for some small value of k . On the other hand, as shown in Figure 1b, NPB-DT exhibits a more irregu-lar communication pattern with no traffic around the main diagonal. The mainreason for using these two benchmarks is to challenge the assumptions of Scotchand default-slurm . Slurm’s allocation policy iterates over the available nodes ina sequential manner. As a result, it is highly probable for processes with ranksin some range [ i − k, i + k ] to be placed on topologically nearby nodes. As faras quality of mappings produced by Scotch is concerned, the work in [5] hasoutlined that in some cases, they might introduce performance degradation. Onthe other hand, the default placement policy used by Slurm is not expected toperform well on an irregular communication pattern like the one exhibited byNPB-DT.Simulated runs of the aforementioned benchmarks are performed as follows:First, the assignment of processes to platform nodes is derived. Then, this map-ping is fed to SimGrid in the form of a machine file. For all the simulated results itle Suppressed Due to Excessive Length 13 in this section, the platform assumed is an 8x8x8 Torus, i.e. 3D Torus with 8nodes on each dimension. Using SimGrid’s smpirun tool, we run a simulation ofthe corresponding application. For NPB-DT, the performance metric is comple-tion time. For LAMMPS, the number of timesteps per second (reported by theapplication) is used as the performance metric. For the case of Scotch, the appli-cation’s communication graph is extracted through a trial run with the profilingtool described in Section 3. Then, this communication graph is given as inputto Scotch, along with a representation of the platform. Scotch in turn producesthe mapping of that job’s processes on platform nodes. Finally, this mapping isgiven as input to SimGrid. (a) (b) Fig. 3: a)Execution Time for NPB-DT, b)Timesteps/s simulated for LAMMPSAs Figure 3a shows, for the NPB-DT benchmark, Scotch achieves the lowerexecution time, with the greedy heuristic coming next. The execution timeachieved by Scotch is 22%, 3%, and 11% lower than Default-slurm, Greedy,and Random, respectively. Figure 3b depicts the timesteps per second achieved,by the various process placement approaches for the case of LAMMPS, withdifferent number of processes. Higher values correspond to higher throughput interms of simulated timesteps per second, and thus to better performance. For thecase of LAMMPS with 32, 64 and 128 processes, Scotch outperforms all otherplacement approaches. For the case of LAMMPS with 256 processes though, thedefault placement policy of Slurm achieves higher performance. As discussed inthe beginning of this section, the default placement policy of Slurm is expected tobe beneficial for applications exhibiting a more regular communication patternlike LAMMPS. Another reason for this performance difference is the arrange-ment of the underlying topology. To further explore its effect, different instancesof the 3D torus topology are generated (all with a total of 256 nodes): 8x8x8,4x8x16, 8x4x16, 4x4x32, 4x32x4. For each topology arrangement, we derive anassignment of processes to platform nodes, both through Scotch and the default placement policy of Slurm. This assignment is then provided to SimGrid in theform of a machine file before simulating the execution of LAMMPS.Table 1: LAMMPS timesteps/s for different 3D torus topology arrangements
Topology arrangement Default-Slurm TOFA8x8x8 247.5 217.24x8x16 188.9 210.18x4x16 232.3 240.94x4x32 212.8 242.34x32x4 159.0 207.4
The corresponding results are summarized in Table 1. There is significantvariability in the performance achieved by both approaches, with TOFA beingless sensitive to the topology arrangement than Default-Slurm. A time-basedbreakdown per MPI primitive reveals that the topology arrangement may havean adverse effect on the time spent in collective operations. For the case of the4x8x16 arrangement for example, the average process time spent in broadcast issignificantly inflated for the case of Default-Slurm.
In this section we evaluate the performance of the proposed approach in thepresence of node failures. Instead of simulating a single MPI job instance, weuse job batches , each consisting of 100 instances of the same MPI application. Weuse two criteria to evaluate each process placement approach: batch completiontime, and abort ratio. The batch completion time is the total time required tocomplete the queue of 100 instances. The abort ratio is the fraction of instancesthat were aborted due to one or more node outages. As explained in Section 3,when a job fails we assume that it is restarted from scratch, rather than beingrestored from a checkpoint. This assumption simplifies the way we update batchcompletion times in the presence of node failures. Each time a job is aborted,the batch completion time is augmented by a time interval equal to a successfulrun, and then the job is restarted.The evaluation experiments in this section compare Slurm’s default place-ment approach (Default-Slurm) with the proposed placement approach (TOFA).Different simulated scenarios are examined based on the following parameters: – MPI application – N : Number of MPI processes involved – p f : node outage probability – n f : number of nodes emulated in the failed state – N f : set of nodes emulated in the failed state itle Suppressed Due to Excessive Length 15 The MPI applications simulated are LAMMPS (64 processes) and NPB-DT(85 processes) of class C. For each application, 10 different batches are createdconsisting of 100 instances each. For each batch, the nodes to populate set N f arerandomly selected and remain the same for all 100 instances of the same batch.Similarly to the evaluation results presented in Section 5.1, the platform assumedconsists of 512 nodes arranged in an 8x8x8 Torus. In each SimGrid scenario, foremulating a node as being in the failed state, the following approach is followed.Each node from set N f is assigned a fixed outage probability p f which is thesame for all nodes. Based on this probability, we determine whether the nodewill be emulated as being in the failed state, or not, in each scenario. Thismeans that for each simulated scenario, a different subset of nodes in N f will beemulated as being in the failed state. SimGrid allows to specify different valuesfor a specific link’s capacity, at different points in simulated time. Specifying avalue of zero, will result in all transmissions in a link failing, and consequently,in abortion of the MPI program simulated. These variations in link capacity aredefined in the platform description that is provided as input to SimGrid. Thus,for every node that will be emulated as being in the failed state, the platformdescription is updated by assigning a zero bandwidth value to all links that thisnode participates.Figure 4 depicts the completion time for 10 different batches of NPB-DT jobswith 85 processes each. For each batch, 16 nodes out of all 512 are randomlyselected and are assigned an outage probability of 2%. As this figure shows, forall batches, TOFA achieves significantly lower completion time than the defaultplacement policy of Slurm. More precisely, the average improvement in batchcompletion time, over all 10 batches, is 31%. This drop in batch completiontime is due to two factors: (1) the reduction in the communication cost as theresult of topology- and application profile-aware placement; and (2) the drop inthe jobs being aborted due to node outages. The average job abort ratio (over1000 simulated NPB-DT instances) is 2% for the case of TOFA, and 7 .
4% forSlurm’s default process placement policy. The average duration of each simulatedscenario reported in Figure 4 is limited to 30 seconds (on average). The effectof aborted jobs on average batch completion time is expected to be even moresignificant for larger problem sizes.The following simulation results are derived from LAMMPS runs of the rhodopsin problem with 64 processes. Two different scenarios are simulated.In the first one, eight nodes have a 2% probability of entering the failed state.In the second one, the corresponding number is 16, therefore approximately 3%of total nodes have an outage probability of 2%. Figures 5a and 5b depict thecompletion time for 10 different batches of the LAMMPS application, for thetwo aforementioned scenarios, and two different process placement approaches:
TOFA vs. Slurm’s default placement policy (
Default-Slurm ). For all 10 differentbatches, the completion time achieved by
TOFA is lower. On average over all10 batches,
TOFA achieves 17 .
5% and 18 .
9% lower batch completion time than
Default-Slurm , for the cases of 8 and 16 failing nodes, respectively. As it is ex-pected, when the fraction of nodes that are probable to enter the failed state
Fig. 4: NPB-DT batches completion time, 16 faulty nodes with 2% chance each (a) (b)
Fig. 5: a)LAMMPS, 8 faulty nodes 2%, b)LAMMPS, 16 faulty nodes 2% itle Suppressed Due to Excessive Length 17 increases, the gain of allocating resources taking into account these probabilitiesalso increases. As in the case of NPB-DT, the drop in the batch completion timeis attributed to both the reduction of the communication cost and the reducedoverhead of failed jobs that need to be restarted. Another interesting observationis that, for the scenario depicted in Figure 5a, where 8 nodes are emulated asfaulty,
TOFA always manages to find 64 consecutive non-faulty nodes and thus,achieves a zero job abort ratio. This is not the case though, for the scenario with16 faulty nodes, where the corresponding job abort ratio achieved is 1 .
1% for
TOFA and 4 .
0% for
Default-Slurm .It is also interesting to compare the average benefit over all 10 batchesachieved by
TOFA for NPB-DT and LAMMPS. As also discussed in Section 5.1,LAMMPS exhibits a more regular communication pattern compared to NPB-DT. As a result, the benefit of a topology aware process placement approachover the default placement policy of Slurm is expected to be lower. Indeed, thecorresponding benefit in average batch completion time achieved by
TOFA is18 .
9% for the case of LAMMPS and 31% for the case of NPB-DT (with 16failing nodes).
In this work, we present a process placement approach for MPI jobs that im-proves completion time. For deriving the assignment of job processes to nodes,we take into account both the topology and the job’s communication patterns.Differing from similar approaches, we also take into account node failures forpost-processing the graph that models the topology. The goal of this approachis to reduce communication cost due to inter-node traffic and also reduce theoverhead of restarting jobs that were aborted due to node failures. The assign-ment of processes to nodes is formulated as a topology mapping problem. TOFAhas been integrated into Slurm, using plugins that extend its functionality. Theproposed approach has been evaluated in the SimGrid environment using twodifferent benchmarks. For the case where around 3% of all nodes have an outageprobability of 2%, TOFA reduces completion time by 18 . Acknowledgments
We thankfully acknowledge the support of the European Commission under theHorizon 2020 Framework Programme for Research and Innovation through theprojects EuroEXA (Grant Agreement ID: 754337) and ExaNeSt (Grant Agree-ment ID: 671553).
References
1. Top500 Supercomputer list.
2. Alm´asi, G., Heidelberger, P., Archer, C.J., Martorell, X., Erway, C.C., Moreira,J.E., Steinmacher-Burow, B., Zheng, Y.: Optimization of mpi collective commu-nication on bluegene/l systems. In: Proceedings of the 19th Annual InternationalConference on Supercomputing. pp. 253–262. ICS ’05, ACM, New York, NY, USA(2005)3. Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Fatoohi,R.A., Frederickson, P.O., Lasinski, T.A., Simon, H.D., Venkatakrishnan, V., Weer-atunga, S.K.: The nas parallel benchmarks. Tech. rep., The International Journalof Supercomputer Applications (1991)4. Bokhari: On the mapping problem. IEEE Transactions on Computers
C-30 (3),207–214 (March 1981)5. Bordage, C., Jeannot, E.: Process affinity, metrics and impact on perfor-mance: An empirical study. In: 2018 18th IEEE/ACM International Sympo-sium on Cluster, Cloud and Grid Computing (CCGRID). pp. 523–532 (2018).https://doi.org/10.1109/CCGRID.2018.000796. Casanova, H., Giersch, A., Legrand, A., Quinson, M., Suter, F.: Versatile, scal-able, and accurate simulation of distributed applications and platforms. Journal ofParallel and Distributed Computing (10), 2899–2917 (Jun 2014)7. Degomme, A., Legrand, A., Markomanolis, G., Quinson, M., Stillwell, M.L., Suter,F.: Simulating MPI applications: the SMPI approach. IEEE Transactions on Par-allel and Distributed Systems (8), 14 (Aug 2017)8. Dogan, A., Ozguner, F.: Matching and scheduling algorithms for minimizing ex-ecution time and failure probability of applications in heterogeneous computing.IEEE Transactions on Parallel and Distributed Systems (3), 308–323 (2002)9. El-Sayed, N., Zhu, H., Schroeder, B.: Learning from failure across multiple clusters:A trace-driven approach to understanding, predicting, and mitigating job termi-nations. In: 2017 IEEE 37th International Conference on Distributed ComputingSystems (ICDCS). pp. 1333–1344 (2017)10. E.N.Elnozahy, Bianchini, R., El-Ghazawi, T., Fox, A., Godfrey, F., Hoisie, A.,McKinley, K., Melhem, R., Plank, J., Ranganathan, P., Simons, J.: System re-silience at extreme scale. Tech. rep., Defense Advanced Research Project Agency(2008)11. Ercal, F., Ramanujam, J., Sadayappan, P.: Task allocation onto a hypercube byrecursive mincut bipartitioning. Journal of Parallel and Distributed Computing (1), 35 – 44 (1990)12. Fu, S.: Failure-aware resource management for high-availability computing clusterswith distributed virtual machines. J. Parallel Distrib. Comput. (4), 384–393 (Apr2010). https://doi.org/10.1016/j.jpdc.2010.01.002itle Suppressed Due to Excessive Length 1913. Hakem, M., Butelle, F.: Reliability and scheduling on systems subject to failures.In: 2007 International Conference on Parallel Processing (ICPP 2007). pp. 38–38(2007)14. Heien, E., LaPine, D., Kondo, D., Kramer, B., Gainaru, A., Cappello, F.: Mod-eling and tolerating heterogeneous failures in large parallel systems. In: SC ’11:Proceedings of 2011 International Conference for High Performance Computing,Networking, Storage and Analysis. pp. 1–11 (2011)15. Hoefler, T., Snir, M.: Generic topology mapping strategies for large-scale parallelarchitectures. In: Proceedings of the International Conference on Supercomputing.pp. 75–84. ICS ’11, ACM, New York, NY, USA (2011)16. Jeannot, E., Mercier, G., Tessier, F.: Process placement in multicore clus-ters:algorithmic issues and practical techniques. IEEE Transactions on Paralleland Distributed Systems (4), 993–1002 (April 2014)17. Jeannot, E., Mercier, G.: Near-optimal placement of mpi processes on hierarchicalnuma architectures. In: D’Ambra, P., Guarracino, M., Talia, D. (eds.) Euro-Par2010 - Parallel Processing. pp. 199–210. Springer Berlin Heidelberg, Berlin, Hei-delberg (2010)18. Kogge, P., Borkar, S., Campbell, D., Carlson, W., Dally, W., Denneau, M., Franzon,P., Harrod, W., Hiller, J., Keckler, S., Klein, D., Lucas, R.: Exascale computingstudy: Technology challenges in achieving exascale systems. Defense Advanced Re-search Projects Agency Information Processing Techniques Office (DARPA IPTO),Techinal Representative (01 2008)19. Liu, J., Mamidala, A.R., Panda, D.K.: Fast and scalable mpi-level broadcast us-ing infiniband’s hardware multicast support. In: 18th International Parallel andDistributed Processing Symposium, 2004. Proceedings. pp. 10– (April 2004)20. Liu, J., Wu, J., Panda, D.K.: High performance rdma-based mpi implementationover infiniband. Int. J. Parallel Program. (3), 167–198 (Jun 2004)21. Ma, T., Bosilca, G., Bouteiller, A., Dongarra, J.J.: Kernel-assisted and topology-aware mpi collective communications on multicore/many-core platforms. Journalof Parallel and Distributed Computing (7), 1000 – 1010 (2013), best Papers:International Parallel and Distributed Processing Symposium (IPDPS) 2010, 2011and 201222. Machida, F., M. Kawato, Maeno, Y.: Redundant virtual machine placement forfault-tolerant consolidated server clusters. In: 2010 IEEE Network Operations andManagement Symposium - NOMS 2010. pp. 32–39 (2010)23. Martino, C.D., Kramer, W., Kalbarczyk, Z., Iyer, R.: Measuring and understandingextreme-scale application resilience: A field study of 5,000,000 hpc application runs.In: 2015 45th Annual IEEE/IFIP International Conference on Dependable Systemsand Networks. pp. 25–36 (June 2015). https://doi.org/10.1109/DSN.2015.5024. Mirsadeghi, S.H., Afsahi, A.: Topology-aware rank reordering for mpi collectives.In: 2016 IEEE International Parallel and Distributed Processing Symposium Work-shops (IPDPSW). pp. 1759–1768 (May 2016)25. MPI:A Message-Passing Interface Standard.
26. Pakin, S.: Receiver-initiated message passing over rdma networks. In: 2008 IEEEInternational Symposium on Parallel and Distributed Processing. pp. 1–12 (April2008)27. Pellegrini, F., Roman, J.: Scotch: A software package for static mapping by dual re-cursive bipartitioning of process and architecture graphs. In: Liddell, H., Colbrook,A., Hertzberger, B., Sloot, P. (eds.) High-Performance Computing and Networking.pp. 493–498. Springer Berlin Heidelberg, Berlin, Heidelberg (1996)0 Ioannis Vardas, Manolis Ploumidis, and Manolis Marazakis28. Peng, Y., Saldana, M., Chow, P.: Hardware support for broadcast and reduce inmpsoc. In: 2011 21st International Conference on Field Programmable Logic andApplications. pp. 144–150 (Sep 2011)29. Plimpton, S.: Fast parallel algorithms for short-range molecular dynamics. J. Com-put. Phys. (1), 1–19 (Mar 1995)30. Rashti, M.J., Afsahi, A.: Improving communication progress and overlap in mpirendezvous protocol over rdma-enabled interconnects. In: 2008 22nd InternationalSymposium on High Performance Computing Systems and Applications. pp. 95–101 (June 2008)31. Rodrigues, E., Madruga, F., Navaux, P., Panetta, J.: Multi-core aware processmapping and its impact on communication overhead of parallel applications. pp.811–817 (07 2009)32. Ruefenacht, M., Bull, M., Booth, S.: Generalisation of recursive doubling for allre-duce. Parallel Comput. (C), 24–44 (Nov 2017)33. Sack, P., Gropp, W.: Faster topology-aware collective algorithms through non-minimal communication. In: Proceedings of the 17th ACM SIGPLAN Symposiumon Principles and Practice of Parallel Programming. pp. 45–54. PPoPP ’12, ACM,New York, NY, USA (2012)34. Schroeder, B., Gibson, G.: Understanding failures in petascale computers. Jour-nal of Physics: Conference Series (09 2007). https://doi.org/10.1088/1742-6596/78/1/01202235. Scotch: Software package and libraries for sequential and parallel graph partition-ing, static mapping and parallel sparse matrix block ordering.