[PDF] A Distributed Chunk Calculation Approach for Self-scheduling of Parallel Applications on Distributed-memory Systems

Abstract

Loop scheduling techniques aim to achieve load-balanced executions of scientific applications. Dynamic loop self-scheduling (DLS) libraries for distributed-memory systems are typically MPI-based and employ a centralized chunk calculation approach (CCA) to assign variably-sized chunks of loop iterations. We present a distributed chunk calculation approach (DCA) that supports various types of DLS techniques. Using both CCA and DCA, twelve DLS techniques are implemented and evaluated in different CPU slowdown scenarios. The results show that the DLS techniques implemented using DCA outperform their corresponding ones implemented with CCA, especially in extreme system slowdown scenarios.

Full PDF

AA Distributed Chunk Calculation Approach forSelf-scheduling of Parallel Applications onDistributed-memory Systems

Ahmed Eleliemy and Florina M. CiorbaDepartment of Mathematics and Computer ScienceUniversity of Basel, Switzerland a r X i v : . [ c s . D C ] J a n ontents bstract Loop scheduling techniques aim to achieve load-balanced executionsof scientiﬁc applications. Dynamic loop self-scheduling (DLS) librariesfor distributed-memory systems are typically MPI-based and employ acentralized chunk calculation approach (CCA) to assign variably-sizedchunks of loop iterations. We present a distributed chunk calculation ap-proach (DCA) that supports various types of DLS techniques. Using bothCCA and DCA, twelve DLS techniques are implemented and evaluatedin diﬀerent CPU slowdown scenarios. The results show that the DLStechniques implemented using DCA outperform their corresponding onesimplemented with CCA, especially in extreme system slowdown scenarios.

Keywords

Dynamic loop self-scheduling (DLS), Load balancing, Centralizedchunk calculation, distributed chunk calculation3

Introduction

Loops are the prime source of parallelism in scientiﬁc applications [1]. Suchloops are often irregular and a balanced execution of the loop iterations is crit-ical for achieving high performance. However, several factors may lead to animbalanced load execution, such as problem characteristics, algorithmic, andsystemic variations. Dynamic loop self-scheduling (DLS) techniques are devisedto mitigate these factors, and consequently, improve application performance.The DLS’s dynamic aspect refers to assigning independent loop iterations duringapplications’ execution. The self-scheduling aspect means that processing ele-ments (PEs) drive the scheduling process by requesting work once they become free . Both these aspects, dynamic and self-scheduling , make DLS techniques anexcellent candidate to minimize loops’ execution time and achieve a balancedexecution of scientiﬁc applications on parallel systems.DLS techniques typically distinguish (1) How many loop iterations to assignto individual PEs? and (2) Which loop iterations to assign? DLS techniquesassign chunks of loop iterations to each free and available

PE. The calculationof the chunk size, referred to as chunk calculation , is determined for eachtechnique by a mathematical formula . DLS techniques typically assume no de-pendencies between loop iterations , and therefore, loop iterations can be assignedand executed in any order. However, DLS techniques also assume a central workqueue . PEs synchronize their accesses to the central work queue to avoid anyoverlap in the chunk assignment . If a speciﬁc DLS technique calculates twochunks of ﬁfty and ten loop iterations for two PEs: P1 and P2, respectively,both PEs need to synchronize their accesses to the central queue to ensure thatthe ﬁfty loop iterations of P1 do not overlap with the ten loop iterations of P2.The chunk assignment requires exclusive access to the central work queue. Thisexclusiveness means that if P2 obtains the access before P1, P2 will obtain theﬁrst ten loop iterations and leave the next ﬁfty loop iterations to P1.There are two approaches to synchronize the chunk assignment: (1) Mak-ing only one PE responsible for accessing the central work queue on behalfof all other PEs. (2) Serializing the PEs’ accesses to the central work queue.Earlier DLS techniques such as guided self-scheduling (GSS) [2], and factor-ing FAC [3], were devised for shared-memory systems. Thus, both synchroniza-tion approaches above were possible to be implemented. In the ﬁrst approach,one thread acts like a master that is exclusively permitted to access the workqueue, while other threads act as workers and only request chunks of work. Incontrast, the second approach would involve using critical regions and atomicoperations to safely access the central work queue.In the middle of the 1990s, distributed-memory systems, such as clusteredcomputational workstations, started to be a dominant architecture for highperformance computing (HPC) systems [4, 5]. For these systems, having a singlePE responsible for the chunk assignment is the only available implementationapproach. Hence, the master-worker execution model has been a prominentapproach to implement DLS techniques on distributed-memory systems. In themaster-worker execution model, the master is a central entity that performs4oth the chunk calculation and the chunk assignment. This centralization mayrender the master a potential performance bottleneck in diﬀerent scenarios. Forinstance, the master degrade the performance of the entire application, when itexperiences a certain slowdown in its processing capabilities.Although centralizing the chunk assignment does not mean centralizing thechunk calculation, many of the recent DLS techniques employ a master-workerexecution model that centralizes both the chunk calculation and the chunk as-signment at the master side [6, 7, 8, 9, 10]. The current work extends ourearlier distributed chunk calculation approach (DCA) [11] and makes the fol-lowing unique contributions.(1)

Separation between concepts and implementations: the DCA [11]and its hierarchical version [12] were motivated by the new advancements in theMPI 3.1 standard, namely MPI one-sided communication and MPI shared-memory.The following question arises:

Is DCA limited to speciﬁc MPI features ? It is es-sential to answer this question because only speciﬁc MPI runtime libraries fullyimplement the features of the standard MPI 3.1. In this manuscript, we separatethe idea of DCA and its implementation. We highlight speciﬁc requirements thata DLS technique needs to fulﬁll to separate chunk calculation that can be dis-tributed across all PEs and the chunk assignment that should be synchronizedacross all PEs. In contrast to earlier eﬀorts [11, 12], we introduce and evaluatea two-sided MPI-based implementation of DCA. This implementation appliesto all existing MPI runtime libraries because they fully support two-sided MPIcommunication.(2)

Support for new DLS categories:

Previously, DCA [11, 12] only sup-ported DLS techniques with either ﬁxed or decreasing chunk size patterns. Inthis extended manuscript, we discuss how DCA supports DLS techniques thatcalculate ﬁxed , decreasing , increasing , and irregular chunk size patterns.(3) DCA in LB4MPI.

We implemented the DCA in an existing MPI-basedscheduling library, called LB4MPI [13, 14]. Initially, all the DLS techniquessupported in LB4MPI were implemented with a centralized chunk calculationapproach (CCA). We redesigned and reimplemented the DLS techniques withDCA in LB4MPI. In addition, we added six new DLS techniques and implementthem with both CCA and DCA.The remainder of this work is organized as follows. Section 2 contains areview of the selected DLS techniques. Existing DLS execution models arereviewed in Section 3. The distributed chunk calculation approach and its ex-ecution model are introduced in Section 4. We discuss in Section 4 whetherthe existing mathematical chunk calculation formulas of the selected DLS tech-niques support DCA, and we show the required mathematical transformationsto these chunk calculation formulas to enable DCA. In Section 5, we present ourextensions to LB4MPI that enable the support of DCA. The design of experi-ments and the experimental results are discussed in Section 6. The conclusionsand future work directions are outlined in Section 7.5

Dynamic Loop Self-scheduling (DLS)

In scientiﬁc applications, loops are the primary source of parallelism [1]. Loopscheduling techniques have been introduced to achieve a balanced load execu-tion of loop iterations. When loops have no cross-iteration dependencies, loopscheduling techniques map individual loop iterations to diﬀerent processing el-ements aiming to have nearly equal ﬁnish times on all processing elements.Loop scheduling techniques can be categorized into static and dynamic loopself-scheduling. The time when scheduling decisions are taken is the crucialdiﬀerence between both categories. Static loop scheduling (SLS) techniquestake scheduling decisions before application execution, while dynamic loop self-scheduling (DLS) techniques take scheduling decisions during application ex-ecution. Therefore, SLS techniques have less scheduling overhead than DLStechniques, and DLS techniques can achieve better load balanced executionsthan SLS techniques in highly dynamic execution environments.DLS techniques can further be divided into non-adaptive and adaptive tech-niques. The non-adaptive techniques utilize certain information that is obtainedbefore the application execution. The adaptive techniques regularly obtain in-formation during the application execution, and the scheduling decisions aretaken based on that new information. The adaptive techniques incur a signiﬁ-cant scheduling overhead compared to non-adaptive techniques and outperformthe non-adaptive ones in highly irregular execution environments.We consider twelve loop scheduling techniques including static (STATIC),ﬁxed size chunk (FSC) [15], guided self-scheduling (GSS) [2], factoring (FAC) [3],trapezoid self-scheduling (TSS) [16], trapezoid factoring self-scheduling (TFSS) [6],ﬁxed increase self-scheduling (FISS) [7], variable increase self-scheduling (VISS) [7],tapering (TAP) [17], performance-based loop scheduling (PLS) [18], and adap-tive factoring (AF) [19]. These techniques employ diﬀerent strategies to achieveload balanced executions. As shown in Figure 1, the calculated chunk sizesmay follow ﬁxed, increasing, decreasing , or unpredictable patterns. Table 1summarizes the notation used in this work to describe how each DLS techniquecalculates the chunk sizes.STATIC is a straightforward technique that divides the loop into P chunksof equal size. Eq. 1 shows how STATIC calculates the chunk size. Since thescheduling overhead is proportional to the number of calculated chunk sizes,STATIC incurs the lowest scheduling overhead because it has the minimumnumber of chunks (only one chunk for each PE). K ST AT ICi = NP (1)SS [21] is a dynamic self-scheduling technique where the chunk size is alwaysone iteration, as shown in Eq. 2. SS has the highest scheduling overhead becauseit has the maximum number of chunks, i.e., the total number of chunks is N .However, SS can achieve a highly load-balanced execution in highly irregularexecution environments. K SSi = 1 (2)6 C hun k S i z e GSS

TAP C hun k S i z e TSS

FAC C hun k S i z e TFSS

FISS C hun k S i z e VISS AF Chunk ID020040060080010001200 C hun k S i z e RND

Chunk ID0100200300400500600700800

PLS

Figure 1: Example of the DLS techniques chunk sizes. The data was obtainedfrom the main loop of Mandelbrot [20] with 1,000 loop iterations and executingon an Intel Xeon processor with 4 MPI ranks. The minimum chunk size is setto be 1 loop iteration. 7able 1: Notation used in the present work

Symbol Description N Total number of loop iterations P Total number of processing elements S Total number of scheduling steps B Total number of scheduling batches i Index of current scheduling step, 0 ≤ i ≤ S − b Index of currently scheduled batch, 0 ≤ b ≤ B − h Scheduling overhead associated with assigning loop iterations R i Remaining loop iterations after i -th scheduling step S i Scheduled loop iterations after i -th scheduling step S i + R i = Nlp start

Index of currently executed loop iteration,0 ≤ lp start ≤ N − L A DLS technique, L ∈ { ST AT IC, F SC, GSS, T AP, T SS, F AC, T F SS, F ISS, V ISS, AF, RND, P LS } K L Size of the largest chunk of a scheduling technique LK LS − Size of the smallest chunk of a scheduling technique LK Li Chunk size calculated at scheduling step i of a scheduling technique Lp j Processing element j , 0 ≤ j ≤ P − h Scheduling overhead for assigning a single iteration σ p i Standard deviation of the loop iterations’ execution times executed on p j µ p i Mean of the loop iterations’ execution times executed on p j T loopp Parallel execution time of the application’s parallelized loops

As a middle point between STATIC and SS, FSC assumes an optimal chunksize that achieves a balanced execution of loop iterations with the smallest over-head. To calculate such an optimal chunk size, FSC considers the variabilityin iterations’ execution time and the scheduling overhead of assigning loop it-erations to be known before applications’ execution. Eq. 3 shows how FSCcalculates the optimal chunk size. K F SCi = √ · N · hσ · P · √ log P (3)GSS [2] is also a compromise between the highest load balancing that canbe achieved using SS and the lowest scheduling overhead incurred by STATIC.Unlike FSC, GSS assigns decreasing chunk sizes to balance loop executionsamong all PEs. At every scheduling step, GSS assigns a chunk that is equal tothe number of remaining loop iterations divided by the total number of PEs, asshown in Eq. 4. K GSSi = R i P , where R i = N − i − (cid:88) j =0 k GSSj (4)TAP [17] is based on a probabilistic analysis that represents a general caseof GSS. It considers the average of loop iterations execution time µ and their8tandard deviation σ to achieve a higher load balance than GSS. Eq. 5 showshow TAP tunes the GSS chunk size based on µ and σ . K T APi = K GSSi + v α − v α · (cid:114) · K GSSi + v α v α = α · σµ (5)TSS [16] assigns decreasing chunk sizes similar to GSS. However, TSS uses alinear function to decrement chunk sizes. This linearity results in low schedulingoverhead in each scheduling step compared to GSS. Eq. 6 shows the linearfunction of TSS. K T SSi = K T SSi − − (cid:36) K T SS − K T SSS − S − (cid:37) , where S = (cid:38) · NK T SS + K T SSS − (cid:39) K T SS = (cid:24) N · P (cid:25) , K T SSS − = 1 (6)FAC [3] schedules the loop iterations in batches of equally-sized chunks.FAC evolved from comprehensive probabilistic analyses and it assumes priorknowledge about µ and σ their mean execution time. Another practical imple-mentation of FAC denoted FAC2, assigns half of the remaining loop iterationsfor every batch, as shown in Eq. 7. The initial chunk size of FAC2 is half ofthe initial chunk size of GSS. If more time-consuming loop iterations are at thebeginning of the loop, FAC2 may better balance their execution than GSS. K F AC i = (cid:26) (cid:6) R i · P (cid:7) , if i mod P = 0 K F AC i − , otherwise. , where R i = N − i − (cid:88) j =0 k F AC j (7)TFSS [6] combines certain characteristics of TSS [16] and FAC [3]. Similarto FAC, TFSS schedules the loop iterations in batches of equally-sized chunks.However, it does not follow the analysis of FAC, i.e., every batch is not half ofthe remaining number of iterations. Batches in TFSS decrease linearly, similarto chunk sizes in TSS. As shown in Eq. 8, TFSS calculates the chunk size as thesum of the next P chunks that would have been computed by the TSS dividedby P. K T F SSi = (cid:40) (cid:80) i + Pj = i K TSSj − P if i mod P = 0 K T F SSi − , otherwise. (8)GSS [2], TAP [17], TSS [16], FAC [3], and TFSS[6] employ a decreas-ing chunk size pattern. This pattern introduces additional scheduling over-head due to the small chunk sizes towards the end of the loop execution. On9istributed-memory systems, the additional scheduling overhead is more sub-stantial than on shared-memory systems. Fixed increase size chunk (FISS) [7]is the ﬁrst scheduling technique devised explicitly for distributed-memory sys-tems. FISS follows an increasing chunk size pattern calculated as in Eq. 9. FISSdepends on an initial value B deﬁned by the user (suggested to be equal to thetotal number of batches of FAC). K F ISSi = K F ISSi − + (cid:100) · N · (1 − B B ) P · B · ( B − (cid:101) , where K F ISS = N (2 + B ) · P (9)VISS [7] follows an increasing pattern of chunk sizes. Unlike FISS, VISSrelaxes the requirement of deﬁning an initial value B . VISS works similarly toFAC2, but instead of decreasing the chunk size, VISS increments the chunk sizeby a factor of two per scheduling step. Eq. 10 shows the chunk calculation ofVISS. K V ISSi = (cid:40) K V ISSi − + K V ISSi − if i mod P = 0 K V ISSi − , otherwise. , where K V ISS = K F ISS (10)AF [19] is an adaptive DLS technique based on FAC. However, in contrastto FAC, AF learns both µ and σ for each computing resource during applicationexecution to ensure full adaptivity to all factors that cause load imbalance. AFdoes not follow a speciﬁc pattern of chunk sizes. AF adapts chunk size based onthe continuous updates of µ and σ during applications execution. Therefore, thepattern of AF’s chunk sizes is unpredictable. Eq. 11 shows the chunk calculationof AF. K AFi = D + 2 · E · R i − √ D + 4 · D · E · R i µ p i ,where D = P (cid:88) p i =1 σ p i µ p i E = (cid:32) P (cid:88) p i =1 µ p i (cid:33) − (11)RND [22] is a DLS technique that utilizes a uniform random distributionto arbitrarily choose a chunk size between speciﬁc lower and upper bounds.The lower and the upper bounds were suggested to be N · P and N · P , respec-tively [22]. In the current work, we suggest a lower and an upper bound as1 and NP , respectively. These bounds make RND have an equal probability ofselecting any chunk size between the chunk size of STATIC and the chunk sizeof SS, which are the two extremes of DLS techniques in terms of scheduling10verhead and load balancing. Eq. 12 represents the integer range of the RNDchunk sizes. K RNDi ∈ (cid:2) , N/P (cid:3) (12)PLS [18] combines the advantages of SLS and DLS. It divides the loop intotwo parts. The ﬁrst loop part is scheduled statically. In contrast, the second partis scheduled dynamically using GSS. The static workload ratio (SWR) is usedto determine the amount of the iterations to be statically scheduled. SWR iscalculated as the ratio between minimum and maximum iteration execution timeof ﬁve randomly chosen iterations. PLS also uses a performance function (PF)to statically assign parts of the workload to each processing element p j basedon the PE’s speed and its current CPU load. In the present work, all PEs areassumed to have the same load during the execution. This assumption is validgiven the exclusive access to the HPC infrastructure used in this work. Eq. 13shows the chunk calculation of PLS. K P LSi = (cid:26) N · SW RP , if R i >N − ( N · SW R ) K GSSi , otherwise. , where SW R = minimum iteration execution timemaximum iteration execution time (13)Table 2 shows the chunk sizes generated by each technique. We obtain thesechunks by assuming that the total number of iterations N is 1,000 and thetotal number of PEs P is 4. In addition to these two parameters, we considerother parameters required by each DLS technique. For instance, FSC requiresthe scheduling overhead h , which is considered to be 0.013716 seconds. TAPrequires µ , σ , and α that are assumed to be 0.1, 0.0005, and 0.0605 seconds,respectively. For FISS and VISS, we consider B and X to be 3 and 4. For PLS,we assume the SW R ratio be 0.7. 11able 2: Chunk sizes for the selected DLS techniques considered in the cur-rent work for the main loop of Mandelbrot [20] with 1,000 loop iterations andexecuting on an Intel Xeon processor with 4 MPI ranks.

Technique Chunk sizes Total number ofchunks

STATIC 250, 250, 250, 250 4SS 1, 1, 1, · · · , 1 1000FSC 17, 17, 17, · · · , 14 59GSS 250, 188, 141, 106, 80, 60, 45, 34, 26, 19, 15, 11, 8, 6, 5, 4, 2 17TAP 250, 188, 141, 106, 80, 60, 45, 34, 26, 19, 15, 11, 8, 6, 5, 3, 3 17TSS 125, 117, 109, 101, 93, 85, 77, 69, 61, 53, 45, 37, 28 13FAC 125, 125, 125, 125, 63, 63, 63, 63, 32, 32, 32, 32, 16, 16, 16, 16,8, 8, 8, 8, 4, 4, 4, 4, 2, 2, 2, 2 28TFSS 113, 113, 113, 113, 81, 81, 81, 81, 49, 49, 49, 49, 17, 11 14FISS 50, 50, 50, 50, 83, 83, 83, 83, 116, 116, 116, 116, 4 13VISS 62, 62, 62, 62, 93, 93, 93, 93, 108, 108, 108, 56 12AF 1,1, · · · , 3544, 3544, 2410, 1785, 235, 202, 179, 321, 247,267, 197, 222, 202, 182, 157, 157, 144, 128, 126, 116, 105, 102,86, 90, 89, 78, 72, 69, 65, 61, 57, 53, 50, 49, 45, 42, 40, 38, 36,37, 33, 33, 29, 28, 28, 24, 23, 22, 21, 21, 19, 18, 17, 16, 16, 15,14, 13, 12, 13, 12, 11, 11, 10, 10, 9, 9, 8, 8, 7, 7, 7, 6, 6, 5, 6, 5,5, 5, 5, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2,1, 1, · · · , 1 316RND 17, 84, 16, 36, 220, 64, 45, 81, 56, 210, 34, 29, 8, 100 14PLS 175, 175, 175, 175, 75, 57, 43, 32, 24, 18, 14, 11, 8, 6, 5, 4, 3 17

In the current work, the self-scheduling aspect of the DLS techniques meansthat once a PE becomes free, it calculates a new chunk of loop iterations tobe executed. The calculated chunk size is not associated with a speciﬁc set ofloop iterations. Since the DLS techniques assume a central work queue, thePE must synchronize with all other PEs to self-assign unscheduled loop itera-tions. We can conclude that there are two operations at every scheduling step:(1) chunk calculation and (2) chunk assignment . In principle, only the chunkassignment requires a sort of global synchronization between all PEs, whilethe chunk calculation does not require synchronization and can be distributedacross all PEs. In practice, existing DLS implementation approaches, especiallyfor distributed-memory systems, do not consider the separation between chunkcalculation and chunk assignment. Hence, the master-worker execution modeldominates all existing DLS implementation approaches.The distributed self-scheduling scheme (DSS) [6] is an example of employingthe master-worker model to implement DLS techniques for distributed-memorysystems. DSS relies on the master-worker execution model, similar to the oneillustrated in Figure 2a. DSS enables the master to consider the speed of theprocessing elements and their loads when assigning new chunks. DSS was laterenhanced by a hierarchical distributed self-scheduling scheme (HDSS) [8] thatemploys a hierarchical master-worker model, as illustrated in Figure 2b. DSSand HDSS assume a dedicated master conﬁguration in which the master PE is12eserved for handling the worker requests. Such a conﬁguration may enhance thescalability of the proposed self-scheduling schemes. However, it results in lowCPU utilization of the master. HDSS [8] suggested deploying the global-masterand the local-master on one physical computing node with multiple processingelements to overcome the low CPU utilization of the master (see Figure 2b).DSS and HDSS were implemented using MPI two-sided communications. Inboth DSS and HDSS, the master is a central entity that performs both the chunk calculation and the chunk assignment .Another MPI-based library that implements several DLS techniques is calledthe load balancing tool (LB tool) [13]. At the conceptual level, the LB tool isbased on a single-level master-worker execution model (see Figure 2a). How-ever, it does not assume a dedicated master. It introduces the breakAfter pa-rameter, which is user-deﬁned, and indicates how many iterations the mastershould execute before serving pending worker requests. This parameter is re-quired for dividing the time of the master between computation and servicingof worker requests. The optimal value of this parameter is application- andsystem-dependent. The LB tool also employs two-sided MPI communications.LB4MPI [14, 23] is an extension of the LB tool [13] that includes certainbug ﬁxes and additional DLS techniques. Both LB and LB4MPI employ amaster-worker execution in which the master is a central entity that performsboth of chunk calculation and the chunk assignment operations.The dynamic load balancing library (DLBL) [24] is another MPI-based li-brary used for cluster computing. It is based on a parallel runtime environmentfor multicomputer applications (PREMA) [25]. DLBL is the ﬁrst tool that em-ployed MPI one-sided communication for implementing DLS techniques. Simi-lar to the LB tool, the DLBL employs a master-worker execution model. Themaster expects work requests. It then calculates the size of the chunk to beassigned and, subsequently, calls a handler function on the worker side. Theworker is responsible for obtaining the new chunk data without any further in-volvement from the master. This means that the master is still a central entitythat performs both of chunk calculation and chunk assignment.The latest advancements in the MPI 3.1 standard, namely the revised andthe clear semantics of the MPI RMA (one-sided communication) [26, 27], en-abled its usage in diﬀerent scientiﬁc applications [28, 29, 30]. This motivatedour earlier work [11] that introduced DCA. The DCA does not require themaster-worker execution scheme [11]. Using MPI RMA, DCA makes one pro-cessing element, called coordinator , store global scheduling information such asthe index of the latest scheduling step i and the index of the previously scheduledloop iteration lp start . The coordinator entity shares the memory address spacewhere the global scheduling information is stored with all workers. Figure 3shows that with certain exclusive load and store operations to the shared mem-ory address space, all entities can simultaneously calculate and assign themselveschunks of non-overlapping loop iterations. The following question arises: Is DCAlimited to speciﬁc MPI features ? It is essential to answer this question becauseonly speciﬁc MPI runtime libraries fully implement the features of the MPI 3.1standard. 13 p p p P-1 ... Request work Assignwork AssignworkRequest work (a) Conventional master-worker execution model p p p k p P-1

Request work Assignwork p p j p k Request work Assignwork Request workAssignworkRequest work Assignwork... ... (b) Global and local masters are located on asingle physical compute node p p k p P-1

Request work Assignwork p p j p k Request work Assignwork Request workAssignworkRequest work Assignwork p ... ... (c) Local masters are distributed across multiplephysical compute nodes LEGENDGlobal master Local master Busy worker Available and requesting workerTwo-sided messagesPhysical compute node

Figure 2: Variants of the master-worker execution model as reported in theliterature. Replication of certain processing elements is just to indicate theirdouble role where the master participates in the computation as a worker.14 p p p p-1 ... Last scheduling step i (1)Get a copy of i and increment the original i by 1(3) Get a copy of lp start and increment the original lp start by K i (3) Get a copy of lp start and increment the original lp start by K i Last start loop index lp start (1)Get a copy of i and increment the original i by 1(2) Calculate K i (2) Calculate K i LEGEND

Coordinator Busy worker Atomic operationsPhysical compute node Chunk-calculationAvailable and requesting worker Memory regionMemory ownership relation

Figure 3: The distributed chunk calculation approach (DCA) using MPI RMAand passive-target synchronization.

The idea of DCA is to ensure that the calculated chunk size at a speciﬁc PEdoes not rely on any information about the chunk size calculated at any otherPE. The chunk calculation formulas (Eq. 1 to 13) can be classiﬁed into straight-forward and recursive.

A straightforward chunk calculation formula onlyrequires some constants and input parameters.

A recursive chunk calcu-lation formula requires information about previously calculated chunk sizes.For instance, STATIC, SS, FSC, and RND have straightforward chunk calcu-lation formulas that do not require any information about previously calcu-lated chunks, while GSS [2], TAP [17], TSS [16], FAC [3], TFSS [6], FISS [7],VISS [7], AF [19], and PLS [18] employ recursive chunk calculation formulas.Certain transformations have been required to convert these recursive formulasinto straightforward formulas to enable DCA. For GSS and FAC, the transfor-mations were already introduced in the literature [3] (Eq. 14 and 15). K (cid:48) GSS i = (cid:38)(cid:18) P − P (cid:19) i · NP (cid:39) (14) K (cid:48) FAC i = (cid:38)(cid:18) (cid:19) i new · NP (cid:39) , i new = (cid:22) iP (cid:23) + 1 (15)As shown in Eq. 5, TAP calculates K GSSi and tunes that value based on µ , σ , and α . Based on Eq. 14, the chunk calculation formula of TSS can be15xpressed as a straightforward formula as follows. K (cid:48) TAP i = K (cid:48) GSS i + v α − v α · (cid:114) · K (cid:48) GSS i + v α v α = α · σµ (16)For TSS, a straightforward formula for the chunk calculation is shown inEq. 17. K (cid:48) TSS i = K T SS − i · (cid:98) K T SS − K T SSS − S − (cid:99) (17)The mathematical derivation that converts Eq. 6 into Eq. 17 is as follows. TheTSS chunk calculation formula can be represented as follows, where C is aconstant. K T SSi = K T SSi − − CC = (cid:36) K T SS − K T SSS − S − (cid:37) K T SS = K T SS − CK T SS = K T SS − C = ( K T SS − C ) − C = K T SS − · CK T SSi = K T SS − i · CK T SSi = K T SS − i · (cid:98) K T SS − K T SSS − S − (cid:99) = K (cid:48) TSS i TFSS [6] is devised based on TSS [16] and FAC [3]. Therefore, the straight-forward formula of TSS (see Eq. 6) can be used to derive the straightforwardformula of TFSS, as shown in Eq. 18. K (cid:48) TFSS i = (cid:80) i + Pj = i K (cid:48) TSS j − P (18)For FISS [7], a straightforward formula for the chunk calculation is shownin Eq. 19. K (cid:48) FISS i = K F ISS + i · (cid:100) · N · (1 − B B ) P · B · ( B − (cid:101) (19)The mathematical derivation that converts Eq. 9 into Eq. 19 is as follows. Giventhat A is a constant, the FISS chunk calculation formula can be represented as16ollows, where C is a constant. K F ISSi = K F ISSi − + CC = (cid:100) · N · (1 − B B ) P · B · ( B − (cid:101) K F ISS = K F ISS + CK F ISS = K F ISS + C = ( K F ISS + C ) + C = K F ISS + 2 · CK F ISSi = K F ISS + i · CK F ISSi = K F ISS + i · (cid:100) · N · (1 − B B ) P · B · ( B − (cid:101) = K (cid:48) FISS i For VISS [7], a straightforward formula for the chunk calculation is shownin Eq. 20. K (cid:48) V ISS i = K F ISS · − (0 . i new . , where i> i new = i mod PK (cid:48) V ISS = K F ISS (20)To derive Eq. 20, we calculate K V ISS , K V ISS , and K V ISS , according Eq. 10. K V ISS = K F ISS + K F ISS , assume K F ISS = aK V ISS = a + a K V ISS = K V ISS + K V ISS a + a a + a K V ISS = K V ISS + K V ISS a + a a + a a + a ) + ( a + a ))2According to the geomertic summation theorem K V ISSi = K F ISS · − (0 . i . K V ISSi = K F ISS · − (0 . i new . K (cid:48) V ISS i , where i> , and i new = i mod P. For PLS, the loop iteration space is divided into two parts. In the ﬁrst part,the PLS chunk calculation formula is equivalent to STATIC, i.e., the chunkcalculation formula is a straightforward formula that is ready to support DCA.In the second part, PLS uses the GSS chunk calculation formula. Therefore,we replace K GSSi in Eq. 13 with K (cid:48) GSS i from Eq. 14 to derive the PLS chunk17alculation (Eq. 21). K (cid:48) PLS i = (cid:26) N · SW RP , if R i >N − ( N · SW R ) K (cid:48) GSS i , otherwise. (21)AF adapts the calculated chunk size according µ p i and σ p i , which can bedetermined only during loop execution. Moreover, at every scheduling step, AFuses R i with µ p i and σ p i to calculate the chunk size. This leads to an unpre-dictable pattern of chunk sizes and makes it impossible to ﬁnd a straightforwardformula for AF. Accordingly, we could not determine a way to implement AFwith a fully distributed chunk calculation . In our implementation, AF with DCArequires additional synchronization of R i across all PEs. All PEs can simulta-neously calculate D and E from Eq. 11. However, each PE needs to synchronizewith all other PEs to calculate each K AFi . LB4MPI [14, 23] is a recent MPI-based library for loop scheduling and dynamicload balancing. LB4MPI extends the LB tool [13] by including certain bugﬁxes and additional DLS techniques. LB4MPI has been used to enhance theperformance of various scientiﬁc applications [31]. In this work, we extendthe LB4MPI in two directions: (1) We enable the support of DCA. All the DLStechniques originally supported in LB4MPI were implemented with a centralizedchunk calculation approach (CCA). We redesign and reimplement them withDCA. (2) We add six additional DLS techniques and implement them withCCA and DCA.While LB4MPI schedules independent loop iterations across multiple MPIprocesses, it assumes that each MPI process has access to the data associatedwith the loop iterations it executes. The simplest way to ensure the validityof that assumption is to replicate the data of all loop iterations across all MPIprocesses. Users can also centralize or distribute data of the loop iterationsacross all MPI processes. In this case, however, users need to provide a wayto their application to exchange the required data associated with the loopiterations.The LB4MPI has six API functions: DLS Parameters Setup , DLS StartLoop , DLS Terminated , DLS StartChunk , DLS EndChunk , and

DLS EndLoop . One canuse these API functions as in Listing 1. For backward compatibility reasons, ourextension of LB4MPI maintained these six APIs. However, we added a new API:

Configure Chunk Calculation Mode that selects between CCA and DCA. Wechanged the functionality of each of the six APIs to include a condition thatchecks the selected approach (CCA or DCA) when the selected approach is CCA,the six APIs work as in the original LB4MPI. For instance,

DLS StartChunk callseither

DLS StartChunk Centralized or DLS StartChunk Decentralized basedon the selected approach.

DLS StartChunk Centralized is a function that https://github.com/unibas-dmi-hpc/DLS4LB.git isting 1: Usage of LB4MPI for loop scheduling and dynamic loadbalancing in scientiﬁc applications < mpi.h > ; < LB4MPI.h > ; main() { .../*application code*/ int mode = DECENTRALIZED; /*or CENTRALIZED*/ Conﬁgure Chunk Calculation Mode( mode ); DLS Parameters Setup( params ); /*includes number of tasks,scheduling method, scheduling parameters µ , σ , · · · etc*/ DLS StartLoop( info, start index,end index, scheduling method ); while (!DLS Terminated( info )) do DLS StartChunk( info , lp start , chunk size ); /*application code to process loop from lp start to lp start + chunk size */ DLS EndChunk( info ); DLS EndLoop( info , scheduled tasks , total time ); } wraps the original CCA of LB4MPI, while DLS StartChunk Decentralized provides the newly added functionality that supports DCA.

Two computationally-intensive parallel applications are considered in this studyto assess the performance potential of the proposed DCA. The ﬁrst application,called PSIA [32], uses a parallel version of the well-known spin-image algorithm(SIA) [33]. SIA converts a 3D object into a set of 2D images. The generated2D images can be used as descriptive features of the 3D object. As shown inListing 2, a single loop dominates the performance of PSIA.The second application calculates the Mandelbrot set [20]. The Mandelbrotset is used to represent geometric shapes that have the self-similarity propertyat various scales. Studying such shapes is important and of interest in dif-ferent domains, such as biology, medicine, and chemistry [34]. As shown inListings 2 and 3, both applications contain a single large parallel loop iterationsthat dominates their execution times. Dynamic and static distributions of themost time-consuming parallel loop across all processing elements may enhanceapplications’ performance. Table 3 summarizes the characteristics of the mainloops of both applications.The target experimental system is called miniHPC . It consists of 26 com-pute nodes that are actively used for research and educational purposes. In thepresent work, we use sixteen dual-socket nodes. Each node has two sockets withIntel Xeon E5-2640 processors and 10 cores per socket. https://hpc.dmi.unibas.ch/HPC/miniHPC.html T parloop is73.41 seconds with STATIC, while the best T parloop is 69.37 with FAC. With FAC,the performance of PSIA is enhanced by 5.5%. Other techniques achieve com- Listing 2:

Parallel spin-image calculations. The main loop is high-lighted in the blue color.spinImagesKernel (W, B, S, OP, M);

Inputs :

W: image width, B: bin size, S: support angle,OP: list of 3D points, M: number of spin-images

Output:

R: list of generated spin-images for i = 0 → M do P = OP[i]; tempSpinImage[W, W]; for j = 0 → length ( OP ) do X = OP[j]; np i = getNormalVector(P); np j = getNormalVector(X); if acos( np i · np j ) ≤ S then k = (cid:38) W/ − np i · ( X − P ) B (cid:39) ; l = (cid:38) (cid:112) || X − P || − ( np i · ( X − P )) B (cid:39) ; if ≤ k < W and 0 ≤ l < W then tempSpinImage[k, l]++; R.append(tempSpinImage); 20 isting 3:

Mandelbrot set calculations. The main loop is highlightedin the blue color.mandelbrotSetCalculations (W, T);

Inputs :

W: image width, CT: Conversion Threshold

Output:

V: Visual representation of mandelbrot set calculations for counter = 0 → W do x = counter / W; y = counter mod W; c = complex(x min + x/W*(x max-x min) , y min +y/W*(y max-y min)); z = complex(0,0) ; for k = 0 → CT OR | z | < . do z = z + c ; if k = CT then set V ( x, y ) to black; else set V ( x, y ) to blue;Table 3: Characteristics of the selected applications’ main loop Application characteristics ApplicationPSIA Mandelbrot

Number of loop iterations 262,144 262,144Maximum iteration execution time (s) 0.190161 0.06237Minimum iteration execution time (s) 0.0345 0.000001Average iteration execution time (s) 0.07298 0.01025Standard deviation (s) 0.00885 0.0187Coeﬃcient of variation (c.o.v.) 0.256 1.824parable performance. For instance, T parloop is 69.53 seconds with PLS. In contrast,other techniques degrade the performance of PSIA. GSS and RND degrade thePSIA performance by 2.7% and 61.2% compared STATIC. For the DCA, onecan make the same observations regarding the best and the worst techniques.The CCA and DCA versions of all techniques are comparable to each other, i.e.,the diﬀerence in performance ranges from 2% to 3%.Figures 4b and 4c show the performance of both CCA and DCA with dif-ferent techniques for PSIA when the injected delay is 10 and 100 microseconds,respectively. In Figure 4b, one can notice that when the injected delay is 10microseconds, the performance diﬀerences between CCA and DCA with all tech-niques are in the range of 2% to 3%. Considering the variation in T parloop of the 20repetitions of each experiment, one observes that both approaches still have acomparable performance. For the largest injected delay, the DLS techniques im-plemented with CCA are more sensitive than the DLS techniques implemented21able 4: Design of factorial experiments Factor Value

Application PSIA with DCA (see Figure 4c). For Mandelbrot, one can notice the same behavior,i.e., when there is no injected delay or when the inject delay is 10 microseconds,the performance diﬀerences between CCA and DCA with all techniques are mi-nor (see Figures 5a and 5b). In contrast, Figure 5c shows that the DCA versionof all the DLS techniques is more capable of maintaining its performance thanthe CCA version.Another interesting observation is the extreme poor performance of AF withCCA (see Figure 5c). AF is an adaptive technique, and it accounts for all sourcesof load imbalance that aﬀect applications during the execution. However, AFonly considers mu pi and σ pi . Since we inject the delay in the chunk calculationfunction, AF cannot account for such a delay, and it works similarly to thecase of no injected delay. Considering the characteristics of the Mandelbrotapplication, the majority of the AF chunks are equal to 1 loop iterations. Thisﬁne chunk size leads to an increased number of chunks, i.e., the performancesigniﬁcantly decreased because the injected delay is proportional with the totalnumber of chunks. For PSIA, the corresponding AF implementation (with CCA)does not have the same extreme poor performance (see Figure 4c) because theAF chunk sizes in the case of PSIA are larger than the chunk sizes in the caseof Mandelbrot. In the present work, we studied how the distributed chunk calculation ap-proach (DCA) [11] can be applied to diﬀerent categories of DLS techniquesincluding DLS techniques that have ﬁxed, decreasing, increasing, and irregu-lar chunk size patterns. The mathematical formula of the chunk calculationof any DLS technique can either be straightforward or recursive. The DCArequires that the mathematical formula of the chunk calculation be straight-forward. When one of the selected DLS techniques employs a recursive chunkcalculation formula, we showed the mathematical transformations required to22 T A T I C F S C G SS T A P T SS F A C T F SS F I SS V I SS R N D A F P L S Scheduling technique P a r a ll e l a pp li c a t i o n e x e c u t i o n t i m e ( s ) CCA DCA BEST (a) Without an injected delay S T A T I C F S C G SS T A P T SS F A C T F SS F I SS V I SS R N D A F P L S Scheduling technique P a r a ll e l a pp li c a t i o n e x e c u t i o n t i m e ( s ) CCA DCA BEST (b) With low injected delay (10 microseconds) S T A T I C F S C G SS T A P T SS F A C T F SS F I SS V I SS R N D A F P L S Scheduling technique P a r a ll e l a pp li c a t i o n e x e c u t i o n t i m e ( s ) CCA DCA BEST (c) With severe injected delay (100 microseconds)

Figure 4: Parallel application execution time of PSIA in the three slowdownscenariosconvert it into a straightforward formula.By implementing the DCA in an MPI-based library called LB4MPI [14, 23]23 T A T I C F S C G SS T A P T SS F A C T F SS F I SS V I SS R N D A F P L S Scheduling technique P a r a ll e l a pp li c a t i o n e x e c u t i o n t i m e ( s ) CCA DCA BEST (a) Without an injected delay S T A T I C F S C G SS T A P T SS F A C T F SS F I SS V I SS R N D A F P L S Scheduling technique P a r a ll e l a pp li c a t i o n e x e c u t i o n t i m e ( s ) CCA DCA BEST (b) With low injected delay (10 microseconds) S T A T I C F S C G SS T A P T SS F A C T F SS F I SS V I SS R N D A F P L S Scheduling technique P a r a ll e l a pp li c a t i o n e x e c u t i o n t i m e ( s ) CCA DCA BEST (c) With severe injected delay (10 microseconds)

Figure 5: Parallel application execution time of Mandelbrot in the three slow-down scenariosusing the two-sided MPI communication that is supported by all existing MPIruntime libraries, the present work answered the question:

Is DCA limited to peciﬁc MPI features? The present work showed the performance of CCA and DCA in three diﬀer-ent slowdown scenarios. In the ﬁrst scenario, no delay was injected during thechunk calculation. In the other two scenarios, a constant delay (small and large)was injected during the chunk calculation. These scenarios represent cases whena slowdown aﬀects the CPU and results in slowing down the chunk calculation.For these two scenarios, the injected delay was 0.00001 and 0.0001 seconds,respectively. For the large injected delay, the results showed that the DLS tech-niques implemented using the DCA were slightly aﬀected by the injected delay.This conﬁrms the performance potential of the DCA [11]. In a highly uncertainexecution environment, when a slowdown aﬀects the computational power ofthe coordinator (master), DCA is a better alternative to the CCA.DCA incurs more communication messages than CCA, speciﬁcally the mes-sage required to exchange scheduling data between the coordinator and theworkers. This increased number of messages could make DCA underperformCCA if the delay was injected during the chunk assignment rather than thechunk calculation. Therefore, we plan to assess the performance of DCA withvarious communication slowdown scenarios. Another future extension for LB4MPIis to enable dynamic selection of the scheduling approach (DCA or CCA) thatminimizes applications’ execution time.

Acknowledgment

This work has been supported by the Swiss National Science Foundation in thecontext of the Multi-level Scheduling in Large Scale High Performance Com-puters” (MLS) grant number 169123 and by the Swiss Platform for AdvancedScientiﬁc Computing (PASC) project SPH-EXA: Optimizing Smooth ParticleHydrodynamics for Exascale Computing.

References [1] Z. Fang, P. Tang, P.-C. Yew, C.-Q. Zhu, Dynamic Processor Self-schedulingfor General Parallel Nested Loops, IEEE Transactions on Computers 39 (7)(1990) 919–929.[2] C. D. Polychronopoulos, D. J. Kuck, Guided Self-Scheduling: A Practi-cal Scheduling Scheme for Parallel Supercomputers, IEEE Transactions onComputers 100 (12) (1987) 1425–1439.[3] S. Flynn Hummel, E. Schonberg, L. E. Flynn, Factoring: A Method forScheduling Parallel Loops, Communications of the ACM 35 (8) (1992) 90–101.[4] D. J. Becker, T. Sterling, D. Savarese, J. E. Dorband, U. A. Ranawak, C. V.Packer, BEOWULF: A Parallel Workstation for Scientiﬁc Computation, in:25roceedings of the International Conference on Parallel Processing, Vol. 95,1995, pp. 11–14.[5] K. Castagnera, D. Cheng, R. Fatoohi, E. Hook, B. Kramer, C. Manning,J. Musch, C. Niggley, W. Saphir, D. Sheppard, et al., Clustered Worksta-tions and Their Potential Role as High Speed Compute Processors, NASComputational Services Technical Report RNS-94-003, NAS Systems Di-vision, NASA Ames Research Center.[6] A. T. Chronopoulos, R. Andonie, M. Benche, D. Grosu, A Class of LoopSelf-scheduling for Heterogeneous Clusters, in: Proceedings of InternationalConference on Cluster Computing, 2001, pp. 282–291.[7] T. Philip, C. R. Das, Evaluation of Loop Scheduling Algorithms on Dis-tributed Memory Systems, in: Proceedings of the International Conferenceon Parallel and Distributed Computing Systems, 1997, pp. 76–94.[8] A. T. Chronopoulos, S. Penmatsa, N. Yu, D. Yu, Scalable Loop Self-scheduling Schemes for Heterogeneous Clusters, International Journal ofComputational Science and Engineering 1 (2-4) (2005) 110–117.[9] I. Banicescu, V. Velusamy, J. Devaprasad, On the Scalability of DynamicScheduling Scientiﬁc Applications with Adaptive Weighted Factoring, Jour-nal of Cluster Computing 6 (3) (2003) 215–226.[10] R. L. Cari˜no, I. Banicescu, Dynamic Load Balancing With Adaptive Fac-toring Methods in Scientiﬁc Applications, The Journal of Supercomputing44 (1) (2008) 41–63.[11] A. Eleliemy, F. M. Ciorba, Dynamic Loop Scheduling Using MPI Passive-Target Remote Memory Access, in: Proceedings of the 27th EuromicroInternational Conference on Parallel, Distributed and Network-based Pro-cessing, 2019, pp. 75–82.[12] A. Eleliemy, F. M. Ciorba, Hierarchical Dynamic Loop Self-Scheduling onDistributed-Memory Systems Using an MPI+MPI Approach, in: Proceed-ings of the International Parallel and Distributed Processing SymposiumWorkshops, 2019, pp. 689–697.[13] R. L. Cari˜no, I. Banicescu, A Load Balancing Tool for Distributed ParallelLoops, Journal of Cluster Computing 8 (4) (2005) 313–321.[14] A. Mohammed, A. Eleliemy, F. M. Ciorba, F. Kasielke, I. Banicescu, AnApproach for Realistically Simulating the Performance of Scientiﬁc Ap-plications on high Performance Computing Systems, Future GenerationComputer Systems 111 (2020) 617–633.[15] C. P. Kruskal, A. Weiss, Allocating Independent Subtasks on Parallel Pro-cessors, IEEE Transactions on Software Engineering SE-11 (10) (1985)1001–1016. 2616] T. H. Tzen, L. M. Ni, Trapezoid Self-Scheduling: A Practical Schedul-ing Scheme for Parallel Compilers, IEEE Transactions on parallel and dis-tributed systems 4 (1) (1993) 87–98.[17] S. Lucco, A Dynamic Scheduling Method for Irregular Parallel Programs,in: Proceedings of the ACM Conference on Programming Language Designand Implementation, 1992, pp. 200–211.[18] W. Shih, C. Yang, S. Tseng, A Performance-based Parallel Loop Schedulingon Grid Environments, The Journal of Supercomputing 41 (3) (2007) 247–267.[19] I. Banicescu, Z. Liu, Adaptive Factoring: A Dynamic Scheduling MethodTuned to the Rate of Weight Changes, in: Proceedings of th High perfor-mance computing Symposium, 2000, pp. 122–129.[20] B. B. Mandelbrot, Fractal Aspects of the Iteration of z →→