[PDF] Dynamic Loop Scheduling Using MPI Passive-Target Remote Memory Access

Abstract

Scientific applications often contain large computationally-intensive parallel loops. Loop scheduling techniques aim to achieve load balanced executions of such applications. For distributed-memory systems, existing dynamic loop scheduling (DLS) libraries are typically MPI-based, and employ a master-worker execution model to assign variably-sized chunks of loop iterations. The master-worker execution model may adversely impact performance due to the master-level contention. This work proposes a distributed chunk-calculation approach that does not require the master-worker execution scheme. Moreover, it considers the novel features in the latest MPI standards, such as passive-target remote memory access, shared-memory window creation, and atomic read-modify-write operations. To evaluate the proposed approach, five well-known DLS techniques, two applications, and two heterogeneous hardware setups have been considered. The DLS techniques implemented using the proposed approach outperformed their counterparts implemented using the traditional master-worker execution model.

Full PDF

DDynamic Loop Scheduling UsingMPI Passive-Target Remote Memory Access

Ahmed Eleliemy and Florina M. CiorbaDepartment of Mathematics and Computer ScienceUniversity of Basel, SwitzerlandJanuary 10, 2019 a r X i v : . [ c s . D C ] D ec ontents bstract Scientiﬁc applications often contain large computationally-intensiveparallel loops. Loop scheduling techniques aim to achieve load balancedexecutions of such applications. For distributed-memory systems, exist-ing dynamic loop scheduling (DLS) libraries are typically MPI-based, andemploy a master-worker execution model to assign variably-sized chunksof loop iterations. The master-worker execution model may adverselyimpact performance due to the master-level contention. This work pro-poses a distributed chunk-calculation approach that does not require themaster-worker execution scheme. Moreover, it considers the novel fea-tures in the latest MPI standards, such as passive-target remote memoryaccess , shared-memory window creation, and atomic read-modify-write op-erations. To evaluate the proposed approach, ﬁve well-known DLS tech-niques, two applications, and two heterogeneous hardware setups havebeen considered. The DLS techniques implemented using the proposedapproach outperformed their counterparts implemented using the tradi-tional master-worker execution model. Keywords

Dynamic loop scheduling; Distributed-memory systems;Master-worker execution model; MPI; Passive-target remote memory ac-cess. Introduction

Over the past decade, the increasing demand for computational power of sci-entiﬁc applications played a signiﬁcant role in the development of modern highperformance computing (HPC) systems. The advancements of modern HPCsystems at both, hardware and software levels, raise questions regarding thebeneﬁts of these advantages for successful algorithms and techniques proposedin the past. Algorithms and techniques may, therefore, need to be revisitedand re-evaluated to fully leverage the capabilities of modern HPC systems. Dy-namic loop scheduling (DLS) techniques are important examples of successfulscheduling techniques proposed over the years. The DLS techniques are criticalfor scheduling parallel loops that are the main source of parallelism in scien-tiﬁc applications [1]. A large body of work on DLS was introduced betweenthe late 1980’s and the early 2000’s [2–8]. These DLS techniques were usedin several scientiﬁc applications to achieve a load balanced execution of loopiterations. Several factors can hinder such a load balanced execution, and con-sequently, degrade applications’ performance. Speciﬁcally, problem characteris-tics, non-uniform input data sets, as well as algorithmic and systemic variationslead to diﬀerent execution times of each loop iteration. The DLS techniques aredesigned to mitigate load imbalance due to the aforementioned factors.Dynamic loop self-scheduling-based techniques such as, self-scheduling (SS)[2], guided self-scheduling (GSS) [3], trapezoid self-scheduling (TSS) [4], fac-toring (FAC) [5], and weighted factoring (WF) [6], constitute an importantcategory of DLS techniques. The distinguishing aspect of loop self-schedulingis that whenever a processing element becomes available and requests work , itobtains a collection of loop iterations (called a chunk) from a central work queue .Each DLS technique uses a certain function to calculate chunk sizes.Several implementations of self-scheduling-based techniques [8–12] employthe master-worker execution model, and use the classical two-sided MPI com-munication model. In the master-worker execution model, a processing el-ement (called master) holds all the information required to calculate thechunks and serves work requests from other processing elements (called work-ers). Workers request new chunks once they become available, according tothe self-scheduling principle. The master-worker execution model highlightsan important performance-relevant detail concerning the implementation ofself-scheduling-based techniques; it centralizes the chunk-calculations at themaster. This centralization renders the master process a performance bot-tleneck in three scenarios: (1) the master process has a decreased processingcapabilities; this may happen in heterogeneous systems, (2) the master processreceives a large number of concurrent work requests; this may happen in large-scale distributed-memory systems, and (3) a combination of (1) and (2) may bethe case when executing on large-scale heterogeneous systems.The current work proposes an approach to address the ﬁrst execution sce-nario described above. Intuitively, this scenario can be avoided by mapping themaster process to the processing element with the highest processing capabili-ties. However, such a mapping may not always be guaranteed or feasible. For4nstance, system variations during applications’ execution may adversely aﬀectthe computation or the communication capabilities of the master leading toperformance degradation.The MPI-2 standard [13] introduced remote memory access (RMA) seman-tics that were seldomly adopted in scientiﬁc applications because they had sev-eral issues [14]. For instance, the unrestricted allocation of exposed-memoryregions makes the eﬃcient implementation of one-sided functions extremely dif-ﬁcult. The MPI RMA model was signiﬁcantly revised in the MPI-3 standardand more performance-capable RMA semantics were introduced [15, 16]. Forinstance,

MPI Win Allocate was introduced to restrict the memory allocationand to allow more eﬃcient MPI one-sided implementations. The performanceassessment of MPI RMA-based approaches has recently gained increased atten-tion for diﬀerent scientiﬁc applications [17–19].The present work proposes a novel approach for developing DLS techniquesto execute scientiﬁc applications on distributed-memory systems. The proposedapproach distributes the chunk-calculation of the DLS techniques among allprocessing elements. Moreover, the proposed approach is implemented usingthe recent features oﬀered by the MPI-3 standard, such as passive-target syn-chronization, shared-memory window creation, and atomic read-modify-writeoperations. The present work is signiﬁcant for improving the performance ofvarious scientiﬁc applications, such as N-body simulations [20], Monte-Carlosimulations, and computational ﬂuid dynamics [7], that employ DLS techniqueson heterogeneous and large scale modern HPC systems. Moreover, the presentwork provides insights for improving existing DLS libraries [10, 11] such thatthey take advantage of modern HPC systems by exploiting the remote directmemory access (RDMA) capabilities of modern interconnection networks.The main contributions of this work are: (1) Proposal and implementation ofﬁve DLS techniques with distributed chunk-calculation for distributed-memorysystems. (2) Evaluation of the beneﬁt of using MPI one-sided communica-tion and passive-target synchronization mode to implement the DLS tech-niques: SS [2], GSS [3], TSS [4], FAC [5], and WF [6] against other existingapproaches [10, 11, 21].The remainder of this work is organized as follows. Section 2 containsa review of the selected DLS techniques, as well as of the relevant researcheﬀorts on the diﬀerent implementation approaches of DLS techniques fordistributed-memory systems. Also, the background on the MPI RMA modelis presented in Section 2. The proposed distributed chunk-calculation approachand its execution model are introduced in Section 3. The design of experimentsand the experimental results are discussed in Sections 4 and 5, respectively. Theconclusions and the potential future work are outlined in Section 6.5

Background and Related Work

Loop Scheduling

Loops are the richest source of parallelism in scien-tiﬁc applications [1]. Computationally-intensive scientiﬁc applications spendmost of their execution time in large loops. Therefore, eﬃcient schedulingof loop iterations beneﬁts scientiﬁc applications’ performance. In the cur-rent work, ﬁve well-known DLS techniques: self-scheduling (SS) [2], guidedself-scheduling (GSS) [3], trapezoid self-scheduling (TSS) [4], factoring FAC [5],and weighted factoring (WF) [6] are considered. These techniques are consid-ered due to their competitive performance in diﬀerent applications, and dueto the fact that they are at the basis of other DLS techniques. For instance,trapezoid factoring self-scheduling (TFSS) [22] is based on TSS and FAC, whileadaptive weighted factoring (AWF) [7] and its variants [23] are derived fromFAC.The common aspect of all selected DLS techniques is that new chunks ofiterations are assigned to processing elements once they become available and request work . However, each of these DLS techniques employs a diﬀerent func-tion to calculate the size of the chunk to be assigned.The notation used in the present work is given in Table 1. In the SS [2] tech-nique, the assigned chunk, K i , is always 1 loop iteration. Due to the ﬁne-grainedTable 1: Notation used in the present work Symbol Description N Total number of loop iterations P Total number of processing elements S Total number of scheduling steps B Total number of scheduling batches i Index of current scheduling step, 0 ≤ i ≤ S − b Index of currently scheduled batch, 0 ≤ b ≤ B − R i Remaining loop iterations after i -th scheduling step S i Scheduled loop iterations after i -th scheduling step S i + R i = Nlp start

Index of currently executed loop iteration,0 ≤ lp start ≤ N − K Size of the largest chunk K S − Size of the smallest chunk K i Chunk size calculated at scheduling step ip j Processing element j , 0 ≤ j ≤ P − W p j Relative weight of processing element j , 0 ≤ j ≤ P − (cid:80) P − j =0 W p j = Pσ Standard deviation of the loop iterations’ execution times µ Mean of the loop iterations’ execution times T p Parallel execution time of the entire application T loopp Parallel execution time of the application’s parallelized loops6artitioning of the loop iterations, SS can achieve a highly load balanced exe-cution. However, it incurs a high scheduling overhead. GSS [3], TSS [4], andFAC [5] outperform SS in terms of scheduling overhead, by assigning chunks ofdecreasing size. In each scheduling step i , GSS uses a non-linear function tocalculate the chunk sizes. It divides the remaining loop iterations, R i , by thetotal number of processing elements, P .TSS [4] employs a simplistic linear function to calculate the decreasing chunksizes using a ﬁxed decrement. This linearity results in low scheduling overheadin each scheduling step.FAC [5] schedules the loop iterations in batches of P equally-sized chunks.The initial chunk size of FAC is smaller than the initial chunk size of GSS. If moretime-consuming loop iterations exist at the beginning of the loop, GSS may notbalance their execution as eﬃciently as FAC. Unlike GSS and TSS, which cal-culate the chunks deterministically, the chunk-calculation in FAC evolved fromcomprehensive probabilistic analyses. To calculate an optimal chunk size, FACrequires prior knowledge about the standard deviation, σ , of loop iterations’ ex-ecution times and their mean execution time µ . A practical implementation ofFAC, denoted FAC2, does not require µ and σ , and assigns half of the remainingwork in every batch [5]. FAC2 evolved from the probabilistic analysis that gavebirth to FAC, and is considered in the current work.WF [6] uses the FAC function to calculate the batch size. However, the pro-cessing elements execute variably-sized chunks of this batch according to theirrelative weights. The processor weights, W p j , are determined prior to applica-tions’ execution and do not change during the execution. The chunk-calculationfunction of each technique is shown in Table 2. Related Work

Chronopoulos et al. introduced a distributed approachfor implementing self-scheduling techniques (DSS) [24]. The goal was toTable 2: Chunk-calculation per loop self-scheduling technique

Technique Chunk-calculation

STATIC K i = (cid:6) NP (cid:7) .SS K i = 1.GSS K i = (cid:6) R i P (cid:7) , R = N .TSS K i = K i − − (cid:106) K − K S − S − (cid:107) , where S = (cid:108) · NK + K S − (cid:109) , and K = (cid:6) N · P (cid:7) , K S − = 1.FAC2 K i = (cid:40) (cid:6) R i · P (cid:7) , if i mod P = 0 R i − , otherwise.WF K i = (cid:6) W p j × K FAC2 i (cid:7) .7mprove the performance of the self-scheduling techniques on heterogeneousand distributed-memory resources. The proposed scheme was based on themaster-worker execution model, similar to the one illustrated in Figure 1a, andwas implemented using the classical two-sided MPI communication. The mainidea was to enable the master to consider the speed of the processing elementsand their loads when assigning new chunks. Chronopoulos et al. later en-hanced the performance of the DSS scheme using a hierarchical master-workermodel, and proposed the hierarchical distributed self-scheduling (HDSS) [22]that was similar to the one illustrated in Figure 1b. DSS and HDSS assumea dedicated master conﬁguration in which the master processing element is re-served for handling the worker requests. Such a conﬁguration may enhance thescalability of the proposed self-scheduling schemes. However, it results in lowCPU utilization of the master. HDSS [22] suggested deploying the global-masterand the local-master on one physical computing node that has multiple process-ing elements to overcome the low CPU utilization of the master (cf. Figure 1b).Cari˜no et al. proposed a load balancing (LB) tool that integrated sev-eral DLS techniques [11]. At the conceptual level, the LB tool is based ona single-level master-worker execution model (cf. Figure 1a). However, it didnot assume a dedicated-master. It introduced the breakAfter parameter whichis user-deﬁned and indicates how many iterations the master should executebefore serving pending worker requests. This parameter is required for dividingthe time of the master between computation and servicing of worker requests.The optimal value of this parameter is application- and system-dependent. TheLB tool also employed the classical two-sided MPI communication.Banicescu et al. proposed a dynamic load balancing library (DLBL) forcluster computing [10]. The DLBL is based on a parallel runtime environmentfor multicomputer applications (PREMA) [25]. Even though the DLBL wasthe ﬁrst library to utilize MPI one-sided communication, the active messagesynchronization oﬀered by PREMA required a master-worker model. In DLBL,the master expects work requests. Then, it calculates the size of the chunk tobe assigned and, subsequently, calls a handler function on the worker side. Theworker is responsible for obtaining the data of the new chunk from the masterwithout any further involvement from the master.Hierarchical loop scheduling (HLS) [12] was one of the earliest eﬀorts to uti-lize a hybrid MPI and OpenMP programming model to implement DLS tech-niques. HLS employed a hierarchical master-worker execution model, and wasimplemented using the classical two-sided MPI communication and OpenMPthreads. Unlike HDSS [22], HLS distributes the local masters across all phys-ical computing nodes (cf. Figure 1c). The local masters communicate withthe global master to request a new chunk when they have no more iterationsto distribute between their workers. The main limitation of HLS is that theOpenMP worker threads distribute the work using the region without the explicit use of the nowait clause. This incurs implicit localsynchronization at the OpenMP level on local masters.8 p p p P - ... R eque s t w o r k A ss i gn w o r k A ss i gn w o r k R eque s t w o r k ( a ) C o n v e n t i o n a l m a s t e r - w o r k e r e x e c u t i o n m o d e l p p p k p P - R eque s t w o r k A ss i gn w o r k p p j p k R eque s t w o r k A ss i gn w o r k R eque s t w o r k A ss i gn w o r k R eque s t w o r k A ss i gn w o r k ... ... ( b ) G l o b a l a nd l o c a l m a s t e r s a r e l o c a t e d o n a s i n g l e ph y s i c a l c o m pu t e n o d e p p k p P - R eque s t w o r k A ss i gn w o r k p p j p k R eque s t w o r k A ss i gn w o r k R eque s t w o r k A ss i gn w o r k R eque s t w o r k A ss i gn w o r k p ... ... ( c ) L o c a l m a s t e r s a r e d i s t r i bu t e d a c r o ss m u l t i p l e ph y s i c a l c o m pu t e n o d e s L E G E ND G l oba l m a s t e r Lo c a l m a s t e r B u sy w o r k e r A v a il ab l e and r eque s t i ng w o r k e r T w o - s i ded m e ss age s P h ys i c a l c o m pu t e node F i g u r e : V a r i a n t s o f t h e m a s t e r - w o r k e r e x ec u t i o n m o d e l a s r e p o r t e d i n t h e li t e r a t u r e . R e p li c a t i o n o f ce r t a i np r o ce ss i n g e l e m e n t s i s j u s tt o i nd i c a t e t h e i r d o ub l e r o l e w h e r e t h e m a s t e r p a r t i c i p a t e s i n t h ec o m pu t a t i o n a s a w o r k e r . PI RMA Model

In MPI, the memory of each process is by default pri-vate, and other processes cannot directly access it. The MPI RMA model al-lows MPI processes to publicly expose diﬀerent regions of their memory, called windows . One MPI process (origin) can directly access a memory window with-out any involvement of the other (target) process that owns the window. TheMPI RMA has two synchronization modes: passive- and active-target. In theactive-target synchronization, the target process determines the time bound-aries, called epochs , when its window can be accessed. In the passive-targetsynchronization, the target process has no time limits when its window can beaccessed. The present work focuses on the passive-target synchronization be-cause it requires a minimal amount of synchronization, and it eﬃciently allowsthe greatest overlap of computation and communication. Moreover, it yieldsthe development of DLS techniques for distributed-memory systems to be verysimilar to their original implementations for shared-memory systems.

The main challenge to design a distributed chunk-calculation approach is asso-ciated with the chunk-calculation functions of the DLS techniques. To calculatethe current chunk to be assigned, these functions (except for SS) require eitherthe value of the remaining loop iterations R i or the value of the previous chunk K i − (cf. Table 2). Therefore, the chunks have to be calculated in a sequentialmanner, i.e., two chunks cannot be calculated simultaneously because the valuesof R i and K i − change after each chunk-calculation. This serialization perfectlyﬁts any master-worker-based execution approach because the master serves onerequest at a time, and consequently, the chunk-calculation can be performed ina centralized and sequential fashion.The approach proposed in this work introduces certain transformationsof the respective chunk-calculation functions from Table 2, such that thechunk-calculation depends only on the index of the last scheduling step i . Thesetransformations are shown below in Equations 1-3.GSS: K (cid:48) i = (cid:38)(cid:18) P − P (cid:19) i · NP (cid:39) (1)TSS: K (cid:48) i = K − i · (cid:22) K − K S − S − (cid:23) (2)FAC2: K (cid:48) i = (cid:38)(cid:18) (cid:19) i new · NP (cid:39) , i new = (cid:22) iP (cid:23) + 1 (3)For GSS and FAC, the transformations were already introduced in the lit-erature [5]. For TSS, the mathematical derivation of the transformation is asfollows. Given that S , K , and K S − are constants, the TSS equation in Table 210an be represented as follows. K i = K i − − C, where C is a constant value. (4) C = (cid:98) K − K S − S − (cid:99) (5)Calculating K , K , K , ... K i using Equation 4 K = K − C (6) K = K − C = ( K − C ) − C = K − · C (7) K = K − C = ( K − · C ) − C = K − · C (8) K i = K − i · C (9) K i = K − i · (cid:98) K − K S − S − (cid:99) = K (cid:48) i (10)WF uses the chunk-calculation function of FAC and can naturally inherit thetransformed FAC function.The proposed approach uses Equations 1-3 to distribute thechunk-calculation across all processing elements. Thus, only one process-ing element (called coordinator ) stores index of the last scheduling step i andthe index of the last scheduled loop iteration lp start .Figure 2 illustrates the main steps of the proposed distributedchunk-calculation approach: Step 1. the processing element p j obtains a copy of the last scheduling stepindex, i , and atomically increments the global i by one. Step 2. p j only uses its local copy of i (before the increment) to calculate K i with the function of the selected DLS technique (Equations 1-3). Step 3. p j obtains a copy of the last loop index start, lp start , and atomicallyaccumulates the size of the calculated chunk, K i , into it.Finally, p j executes loop iterations between lp start (before accumulation)and min( lp start + K i , N ). The atomic operations in Steps 1 and 3 guaranteethe exclusive access to i and lp start .The MPI RMA model provides the necessary function calls that can be usedin the implementation of the proposed approach. For instance, the coordinatorMPI process can use MPI Win create to expose the shared variables, such as i and lp start , to all other MPI processes. The passive-target synchronization mode( MPI Win lock(MPI LOCK SHARED) ) can be used with certain MPI atomic opera-tions, such as

MPI Get accumulate , to grant the exclusive access to i and lp start by all MPI processes. For more information regarding the implementation, thereader is referred to the code that is developed under the LGPL license andavailable online [26].Figure 3 illustrates the DLS execution using the proposed distributedchunk-calculation approach. The calculation of chunks K and K is distributedbetween processors p and p . The time required to calculate K overlaps withthe time taken to calculate K . In the traditional master-worker execution11 p p p p-1 ... Last scheduling step i (1)Get a copy of i and increment the original i by 1(3) Get a copy of lp start and increment the original lp start by K i (3) Get a copy of lp start and increment the original lp start by K i Last start loop index lp start (1)Get a copy of i and increment the original i by 1(2) Calculate K i (2) Calculate K i LEGEND

Coordinator Busy worker Atomic operationsPhysical compute node Chunk-calculationAvailable and requesting worker Memory regionMemory ownership relation

Figure 2: The proposed distributed chunk-calculation approach using MPI RMAand passive-target synchronization.model, there is no such overlap since all the chunk calculations are centralizedand performed by the master in sequence. The time required to serve the ﬁrstwork request (including chunk-calculation and chunk-assignment) delays thesecond work request. Moreover, the time required to serve the work requests isproportional to the processing capabilities of the master processor, which mayresult in additional delays as discussed in Section 1.The proposed distributed chunk-calculation approach may result in a diﬀer-ent ordering of assigning and executing loop iterations compared to the tradi-tional master-worker execution model. For instance, when GSS is the chosenscheduling technique in Figure 3 and N = 10, p obtains a local copy of thelast scheduling index i = 0 at t . Also, p obtains at t a local copy of thelast scheduling index i = 1. Both, p and p use their copies of i and calculate K = 5 and K = 3, respectively. The proposed approach does not guaranteethat p and p will execute loop iterations from lp start = 0 to lp start = 4 and lp start = 5 to lp start = 7. Figure 3 shows the case when the chunk-calculationon p is longer than on p , and results in assigning p , loop iterations be-tween lp start = 0 and lp start = 2, while p is assigned loop iterations between lp start = 3 and lp start = 7. Given that DLS techniques address, by design, in-dependent loop iterations with no restrictions on the monotonicity of the loopexecution, the proposed approach does not aﬀect the correctness of the loopexecution. 12 t t t t t t p p Get a copy of i ( i = 0)and increment the original i by 1 Get a copy of i ( i = )and increment the original i by 1 Calculate K Calculate K Get a copy of lp start ( lp start = ) and increment the original lp start by K Get a copy of lp start ( lp start = K ) and increment the original lp start by K t t t t t … Figure 3: Schematic execution of the proposed distributed chunk-calculationapproach on two processors that calculate one chunk each.

In the present work, the performance of two diﬀerent implementations of DLStechniques is assessed. The ﬁrst implementation, denoted

One Sided DLS , em-ploys the proposed distributed chunk-calculation approach and uses one-sidedMPI communication in the passive-target synchronization mode. The sec-ond implementation, denoted

Two Sided DLS , employs a master-worker modeland uses the two-sided MPI communication. Both implementations assumea non-dedicated coordinator (or a non-dedicated master) processing element.

Selected Applications

Two computationally-intensive parallel applicationsare considered in this study. The ﬁrst application, called PSIA [27], uses aparallel version of the well-known spin-image algorithm (SIA) [28]. SIA convertsa 3D object into a set of 2D images. The generated 2D images can be used asdescriptive features of the 3D object. The second application calculates theMandelbrot set [29]. The Mandelbrot set is used to represent geometric shapesthat have the self-similarity property at various scales. Studying such shapes isimportant and of interest in diﬀerent domains, such as biology, medicine, andchemistry [30].Both applications contain a single large parallel loop that dominates theirexecution times. Dynamic and static distributions of the most time-consumingparallel loop across all processing elements may enhance applications’ perfor-mance. The pseudocodes of both applications listed in Algorithm 1 and 2.Table 3 summarizes the execution parameters used for both selected appli-cations. These parameters were selected empirically to guarantee a reasonableaverage iteration execution time that is larger than 0.2 seconds.13

LGORITHM 1:

Parallel spin-image calculations. The main loop ishighlighted in the blue color. spinImagesKernel (W, B, S, OP, M); Inputs :

W: image width, B: bin size, S: support angle,OP: list of 3D points, M: number of spin-images

Output:

R: list of generated spin-images for i = 0 → M do P = OP[i]; tempSpinImage[W, W]; for j = 0 → length ( OP ) do X = OP[j]; np i = getNormalVector(P); np j = getNormalVector(X); if acos( np i · np j ) ≤ S then k = (cid:38) W/ − np i · ( X − P ) B (cid:39) ; l = (cid:38) (cid:112) || X − P || − ( np i · ( X − P )) B (cid:39) ; if ≤ k < W and 0 ≤ l < W then tempSpinImage[k, l]++; end end end R.append(tempSpinImage); end LGORITHM 2:

Mandelbrot set calculations. The main loop is high-lighted in the blue color. mandelbrotSetCalculations (W, T); Inputs :

W: image width, CT: Conversion Threshold

Output:

V: Visual representation of mandelbrot set calculations for counter = 0 → W do x = counter / W; y = counter mod W; c = complex(x min + x/W*(x max-x min) , y min +y/W*(y max-y min)); z = complex(0,0) ; for k = 0 → CT OR | z | < . do z = z + c ; end if k = CT then set V ( x, y ) to black; else set V ( x, y ) to blue; end end Table 3: Execution parameters of both selected applicationsApplication Input Size Output size Other parameters [21, 30]PSIA 800,000 3D points [31] 288,000 images 5x5 2D imagebin-size = 0.01support-angle = 2Mandelbrot No input data One image image-width = 1152x1152number of iterations = 1000Z exponent = 415 ardware Platform Speciﬁcations

Two types of computing resources areused in this work. The ﬁrst type, denoted KNL, refers to standalone IntelXeon Phi 7210 manycore processors with 64 cores, 96 GB RAM (ﬂat modeconﬁguration), and 1.3 GHz CPU frequency. The second type, denoted Xeon,refers to two-socket Intel Xeon E5-2640 processors with 20 cores, 64 GB RAM,and 2.4 GHz CPU frequency.These platform types are part of a fully-controlled computing cluster thatconsists of 26 nodes: 22 KNL nodes and 4 Xeon nodes. All nodes are inter-connected in a non-blocking fat-tree topology. The network characteristics are:Intel Omni-Path fabric, 100 GBit/s link bandwidth, and 100 ns network latency.Each KNL node has one

Intel Omni-Path host fabric interface adapter. EachXeon node has two

Intel Omni-Path host fabric interface adapters. All hostfabric adapters use a single PCIe x16 100 Gbps port. As this computing clusteris actively used for research and educational purposes, only 40% of the clustercould be dedicated to the present work, at the time of writing, speciﬁcally 288cores out of the total 696 available cores.In the present work, the total number of cores is ﬁxed to 288 cores, whereas,the ratio between the KNL and the Xeon cores is varied. Two ratios have beenconsidered: 2:1 represents the case when the KNL cores are the dominant typeof computing resources, and 1:2 represents the complementary case where theXeon cores are the dominant computing resources. Table 4 illustrates these tworatios. Also, 48 KNL cores and 16 Xeon cores per node are used, while theremaining cores on each node were left for other system-level processes.

Mapping of the Coordinator (or the Master) Process to a CertainCore

Two mapping scenarios are considered for the assessment of the pro-posed

One Sided DLS approach vs. the

Two Sided DLS approach. In theﬁrst mapping scenario, the process that plays the role of the coordinator for

One Sided DLS or the role of the master for

Two Sided DLS is mapped to aKNL core. The CPU frequency of a single KNL core is 1.3 GHz, while the CPUfrequency of a single Xeon core is 2.4 GHz. Therefore, this mapping represents acase when the coordinator (or the master) process is mapped to one of the coresthat has the lowest processing capabilities. In the second mapping scenario,the process that plays the role of the coordinator (or the master) is mapped toa Xeon core, which is the most powerful processing element in the consideredsystem. Comparing the results of both scenarios shows the adverse impact ofreduced processing capabilities of the master on the performance of the DLStechniques using

Two Sided DLS . On the contrary, the same mapping for thecoordinator process did not aﬀect the performance of the DLS techniques usingTable 4: Ratios between the KNL and Xeon coresRatio

KNL cores Xeon cores Total cores ne Sided DLS . The straightforward parallelization (STATIC) is used as a baseline to assessthe performance of the selected DLS techniques on the target heterogeneouscomputing platform. STATIC assigns (cid:100)

N/P (cid:101) loop iterations to each processingelement. The considered implementation of STATIC follows the self-schedulingexecution model where every worker obtains a single chunk of size (cid:100)

N/P (cid:101) loopiterations at the beginning of the application execution. By employing STATIC,the percentage of the parallel execution time of the selected applications’ mainloops T loopp are 98% and 99.4% of the parallel execution times for PSIA andMandelbrot, respectively. Such high percentages show that the performance ofboth applications is dominated by the execution time of the main loop. Hence,for the remaining results in this section, the analysis concentrates on the parallelloop execution time, T loopp . All experiments were repeated 20 times and themedian results are reported in all ﬁgures.For the PSIA application, Figure 4a shows that SS, GSS, and TSS im-plemented with One Sided DLS outperformed their respective versions using

Two Sided DLS . For instance, when the ratio of the KNL cores to the Xeoncores was 2:1, the parallel loop execution time, T loopp , of SS required 109 and233 seconds with One Sided DLS and

Two Sided DLS , respectively. Similarly,when the ratio was 2:1, the parallel loop execution time, T loopp , of GSS and TSSincreased from 185 and 125 seconds to 236 and 136 seconds, respectively.When the ratio was 1:2, the total processing capabilities of the system in-creased because the number of Xeon cores increased. However, the parallel loopexecution time, T loopp , of SS, GSS, and TSS implemented using Two Sided DLS did not take the advantage of increasing the total number of Xeon cores. For in-stance using

One Sided DLS , changing the ratio from 2:1 to 1:2 reduced the T loopp of SS from 109 to 68.5 seconds. FAC and WF behaved similarly using both, One Sided DLS and

Two Sided DLS .The performance degradation of the DLS techniques with

Two Sided DLS is due to mapping the master to a KNL core, which has the lowest process-ing capabilities (cf. Section 4). Recall that in

Two Sided DLS , the master isresponsible for serving work requests, and therefore, it has to divide the timebetween serving the work requests and performing its own chunks. Therefore,if the master has a lower processing capabilities than the other processes, itbecomes a performance bottleneck. Also, recall that

One Sided DLS is designedto addresses this scenario (Sections 1 and 3). The coordinator process executesits own chunks, and is not responsible for the calculation and the allocation ofthe chunks to the other processes.Figure 4b shows that the DLS techniques with

One Sided DLS perform com-parably to their versions with

Two Sided DLS . For instance using the ratio 2:1,the

One Sided DLS implementation of SS, GSS, TSS, FAC2, and WF required108, 177, 125, 125, and 110 seconds, respectively. The

Two Sided DLS imple-17entation of the same techniques required 105, 175, 135.6, 125, and 106.45seconds, respectively. Also, using the ratio 2:1, the DLS techniques behavedsimilarly regardless of their implementation approach.For the Mandelbrot application, Figure 5 conﬁrms the same performanceadvantages of the proposed approach as for the PSIA application. The DLStechniques implemented with

One Sided DLS performed equally whether thecoordinator was mapped to a KNL core or to a Xeon core. The performanceof certain DLS techniques with

Two Sided DLS degraded when the master wasmapped to a KNL core compared to their performance when the master wasmapped to a Xeon core. 18 :1 1:2 T i m e ( s ) SS GSS

TSS

FAC2 WF One Sided DLS (Proposed) Two Sided DLS (Default) STATIC (Baseline) (a) The coordinator | master is mapped to a KNL core T i m e ( s ) SS GSS

TSS

FAC2 WF (b) The coordinator | master is mapped to a Xeon core Figure 4: Performance of the proposed approach vs. the existing master-workerbased approach for the PSIA. The x -axis represents the two ratios between theKNL cores and the Xeon cores. 19verall, Figures 4 and 5 highlight two important observations. First ob-servation:

The performance variation for executing a certain experiment usingthe

One Sided DLS approach is higher than the performance variation when ex-ecuting the same experiment using the

Two Sided DLS approach. The reasonbehind such variation is the manner in which concurrent messages are imple-mented at the MPI layer in

One Sided DLS and

Two Sided DLS . In the currentwork, the Intel MPI is used to implement both approaches,

One Sided DLS and

Two Sided DLS . Intel MPI uses the

Lock Polling strategy to implement

MPI Win lock in which the origin process repeatedly issues lock-attempt mes-sages to the coordinator process until the lock is granted [16]. On the contrary,

Two Sided DLS uses the

MPI Send , MPI Recv and

MPI Iprobe functions. For In-tel MPI, in the case of simultaneous sends of multiple work requests to the mas-ter process, the master checks the outstanding work requests using

MPI Iprobe ,and serves them by giving priority to the request of the process with the small-est MPI rank. The

One Sided DLS has a high probability to grant the lock todiﬀerent MPI processes at each trial, whereas,

Two Sided DLS always prioritizesrequests from the process with the smallest MPI rank. The GSS has the largestnon-linear decrement among the decrements of the selected DLS techniques.Therefore, GSS is highly-sensitive to the chunk-assignment.20 :1 1:2 T i m e ( s ) SS GSS

TSS

FAC2 WF One Sided DLS (Proposed) Two Sided DLS (Default) STATIC (Baseline) (a) The coordinator | master is mapped to a KNL core T i m e ( s ) SS GSS

TSS

FAC2 WF (b) The coordinator | master is mapped to a Xeon core Figure 5: Performance of the proposed approach vs. the existing master-workerbased approach for the Mandelbrot set. The x -axis represents the two ratiosbetween the KNL cores and the Xeon cores.21 econd observation : FAC and WF exhibit a reduced sensitivity to mappingthe master to a KNL or to a Xeon core. This low sensitivity could be due tothe factoring-based nature of these techniques. Among all the assessed DLStechniques, FAC2 and WF assign chunks in batches, which increases the pos-sibility for the master to have chunks of the same size as the other processingelements. However, further analysis is needed to better understand such reducedsensitivity. A number of DLS techniques has been revisited and re-evaluated in light of andto enable them to beneﬁt from the signiﬁcant advancements in modern HPCsystems, both at hardware and software levels. A distributed chunk-calculationapproach (

One Sided DLS ) has been proposed herein and is implemented usingthe MPI RMA and atomic read-modify-write operations with passive-targetsynchronization mode. The

One Sided DLS approach performs competitivelyagainst existing approaches, such as

Two Sided DLS that uses MPI two-sidedcommunication and employs the conventional master-worker execution model.

One Sided DLS has the potential to alleviate the master-worker level contentionof

Two Sided DLS in large-scale HPC systems.The present work revealed interesting aspects, planned as future work.The performance of the two approaches considered herein,

One Sided DLS and

Two Sided DLS , is planned to be assessed with additional applications. Speciﬁ-cally, the applications that require the return of the intermediate results uponthe execution of each chunk of work. These applications will help to assess theimpact of the data distribution on the

One Sided DLS approach. The scalabilityaspect of the proposed

One Sided DLS approach also requires further study andanalysis.

Acknowledgment

This work has been supported by the Swiss National Science Foundation in thecontext of the Multi-level Scheduling in Large Scale High Performance Com-puters (MLS) grant number 169123 and by the Swiss Platform for AdvancedScientiﬁc Computing (PASC) project SPH-EXA: Optimizing Smooth ParticleHydrodynamics for Exascale Computing.22 eferences [1] Z. Fang, P. Tang, P.-C. Yew, and C.-Q. Zhu, “Dynamic processor self-scheduling for general parallel nested loops,”

IEEE Transactions on Com-puters , vol. 39, no. 7, pp. 919–929, 1990.[2] T. Peiyi and Y. Pen-Chung, “Processor Self-Scheduling for Multiple-NestedParallel Loops,” in

Proceedings of the International Conference on ParallelProcessing , August 1986, pp. 528–535.[3] C. D. Polychronopoulos and D. J. Kuck, “Guided Self-Scheduling: A Prac-tical Scheduling Scheme for Parallel Supercomputers,”

IEEE Transactionson Computers , vol. 100, no. 12, pp. 1425–1439, 1987.[4] T. H. Tzen and L. M. Ni, “Trapezoid Self-Scheduling: A Practical Schedul-ing Scheme for Parallel Compilers,”

IEEE Transactions on parallel anddistributed systems , vol. 4, no. 1, pp. 87–98, 1993.[5] S. Flynn Hummel, E. Schonberg, and L. E. Flynn, “Factoring: A methodfor scheduling parallel loops,”

Communications of the ACM , vol. 35, no. 8,pp. 90–101, 1992.[6] S. Flynn Hummel, J. Schmidt, R. Uma, and J. Wein, “Load-sharing inheterogeneous systems via weighted factoring,” in

Proceedings of the 8thannual ACM symposium on Parallel algorithms and architectures , 1996,pp. 318–328.[7] I. Banicescu, V. Velusamy, and J. Devaprasad, “On the scalability of dy-namic scheduling scientiﬁc applications with adaptive weighted factoring,”

Journal of Cluster Computing , vol. 6, no. 3, pp. 215–226, 2003.[8] W.-C. Shih, C.-T. Yang, and S.-S. Tseng, “A performance-based parallelloop scheduling on Grid environments,”

The Journal of Supercomputing ,vol. 41, no. 3, pp. 247–267, 2007.[9] A. T. Chronopoulos, S. Penmatsa, N. Yu, and D. Yu, “Scalable loop self-scheduling schemes for heterogeneous clusters,”

International Journal ofComputational Science and Engineering , vol. 1, no. 2-4, pp. 110–117, 2005.[10] I. Banicescu, R. L. Cari˜no, J. P. Pabico, and M. Balasubramaniam, “Designand implementation of a novel dynamic load balancing library for clustercomputing,”

Journal of Parallel Computing , vol. 31, no. 7, pp. 736–756,2005.[11] R. L. Cari˜no and I. Banicescu, “A load balancing tool for distributed par-allel loops,”

Journal of Cluster Computing , vol. 8, no. 4, pp. 313–321, 2005.[12] C.-C. Wu, C.-T. Yang, K.-C. Lai, and P.-H. Chiu, “Designing parallel loopself-scheduling schemes using the hybrid MPI and OpenMP programmingmodel for multi-core grid systems,”

The Journal of Supercomputing

International Journalof High Performance Computing and Networking , vol. 1, no. 1-3, pp. 91–99,2004.[15] T. Hoeﬂer, J. Dinan, R. Thakur, B. Barrett, P. Balaji, W. Gropp, andK. Underwood, “Remote memory access programming in MPI-3,”

ACMTransactions on Parallel Computing , vol. 2, no. 2, p. 9, 2015.[16] X. Zhao, P. Balaji, and W. Gropp, “Scalability Challenges in Current MPIOne-Sided Implementations,” in

International Symposium on Parallel andDistributed Computing , 2016, pp. 38–47.[17] H. Zhou and J. Gracia, “Asynchronous progress design for a MPI-basedPGAS one-sided communication system,” in

The International Conferenceon Parallel and Distributed Systems , 2016, pp. 999–1006.[18] J. R. Hammond, S. Ghosh, and B. M. Chapman, “Implementing OpenSH-MEM using MPI-3 one-sided communication,” in

Workshop on OpenSH-MEM and Related Technologies , 2014, pp. 44–58.[19] H. Shan, S. Williams, Y. Zheng, W. Zhang, B. Wang, S. Ethier, andZ. Zhao, “Experiences of Applying One-sided Communication to Nearest-neighbor Communication,” in

Proceedings of the First Workshop on PGASApplications , 2016, pp. 17–24.[20] I. Banicescu and S. Flynn Hummel, “Balancing Processor Loads andExploiting Data Locality in N-body Simulations,” in

Proceedings of theACM/IEEE International Conference for High Performance Computing,Networking, Storage, and Analysis , December 1995, pp. 43–43.[21] A. Eleliemy, A. Mohammed, and F. M. Ciorba, “Eﬃcient Generation ofParallel Spin-images Using Dynamic Loop Scheduling,” in

Proceedings ofthe 8th International Workshop on Multicore and Multithreaded Architec-tures and Algorithms of the 19th IEEE International Conference for HighPerformance Computing and Communications , December 2017, p. 8.[22] A. T. Chronopoulos, S. Penmatsa, N. Yu, and D. Yu, “Scalable Loop Self-Scheduling Schemes for Heterogeneous Clusters,”

International Journal ofComputational Science and Engineering , vol. 1, no. 2-4, pp. 110–117, 2005.[23] R. L. Cari˜no and I. Banicescu, “Dynamic load balancing with adaptive fac-toring methods in scientiﬁc applications,”

The Journal of Supercomputing ,vol. 44, no. 1, pp. 41–63, 2008. 2424] A. T. Chronopoulos, R. Andonie, M. Benche, and D. Grosu, “A class of loopself-scheduling for heterogeneous clusters,” in

Proceedings of InternationalConference on Cluster Computing , 2001, pp. 282–291.[25] K. Barker, A. Chernikov, N. Chrisochoides, and K. Pingali, “A load balanc-ing framework for adaptive and asynchronous applications,”

IEEE Trans-actions on Parallel and Distributed Systems , vol. 15, no. 2, pp. 183–192,2004.[26] A. Eleliemy, “The distributed-chunk calculation approach of dynamic loopscheduling techniques ,” https://c4science.ch/source/dls MPI RMA/, [On-line; accessed 10 December 2018].[27] A. Eleliemy, M. Fayze, R. Mehmood, I. Katib, and N. Aljohani, “Loadbal-ancing on Parallel Heterogeneous Architectures: Spin-image Algorithm onCPU and MIC,” in

Proceedings of the 9th EUROSIM Congress on Mod-elling and Simulation , September 2016, pp. 623–628.[28] A. E. Johnson, “Spin-Images: A Representation for 3-D Surface Matching,”Ph.D. dissertation, Robotics Institute, Carnegie Mellon University, August1997.[29] B. B. Mandelbrot, “Fractal aspects of the iteration of z → λz (1 − z ) forcomplex Λ and z,” Annals of the New York Academy of Sciences , vol. 357,no. 1, pp. 249–259, 1980.[30] P. Jovanovic, M. Tuba, and D. Simian, “A new visualization algorithm forthe Mandelbrot set,” in

Proceedings of the 10th WSEAS International Con-ference on Mathematics and Computers in Biology and Chemistry , 2009,pp. 162–166.[31] K. Wang, G. Lavou´e, F. Denis, A. Baskurt, and X. He, “A benchmarkfor 3D mesh watermarking,” in