An Adaptive Self-Scheduling Loop Scheduler
AAuto Adaptive Irregular OpenMP Loops
Joshua Dennis Booth University of Alabama in Huntsville, Huntsville AL 35899, USA [email protected]
Abstract.
OpenMP is a standard for the parallelization due to the easein programming parallel-for loops in a fork-join manner. Many shared-memory applications are implemented using this model despite not be-ing ideal for applications with high load imbalance, such as those thatmake irregular memory accesses. One parameter, i.e., chunk size , is madeavailable to users in order to mitigate performance loss. However, thisparameter is dependent on architecture, system load, application, and in-put; making it difficult to tune. We present an OpenMP scheduler thatdoes an adaptive tuning for chunk size for unbalanced applications thatmake irregular memory accesses. In particular, this method( iCh ) useswork-stealing for imbalance and adapts chunk size using a force-feedbackmodel that approximates variance of task length in a chunk. This sched-uler has low overhead and allows for active load balancing while theapplications are running. We demonstrate this using both sparse matrix-vector multiplication ( spmv ) and Betweenness Centrality ( BC ) and showthat iCh can achieve average speedups close (i.e., within 1 . × for spmv and 1 . × for BC ) of either OpenMP loops scheduled with dynamic orwork-stealing methods that had chunk size tuned offline. The traditional fork-join model of programming has remained popular due to theease of expressing loops that are rich with parallelism in scientific applications.OpenMP is a popular programming interface due to making use of the fork-joinmodel via programming pragmas , and making the parallelization of loop effort-less [7]. However, the interface has limited options on the assignment of tasks tothreads when the tasks have a wide variation in time and resources. The onlyoption that the runtime user is given access to is a shared static chunk size , inwhich the user can define the number of tasks each thread should process beforerequesting more from a centralized queue. In order to combat the limitationsof this scheduling method, many modern programs are being converted to usedynamic task queues in a tasking model, and additional support for this modelhas been added to OpenMP [17,18]. However, dynamic task queues come with acertain amount of overhead and are not as well supported by legacy applications.In this work, we provide an OpenMP parallel-for schedule with independent chunk size per thread and auto-tuned chunk size that works with work-stealingto allow users to continue to use the favored fork-join model on applicationsthat have tasks lengths that vary throughout execution or when they cannot pay a r X i v : . [ c s . D C ] J u l J.D. Booth for the overhead of tuning for chunk size . We call this method iCh ( i rregular Ch unk), and this method provides a middle ground between traditional fork-join OpenMP scheduling with OpenMP’s built-in dynamic scheduler that usesstatic chunk size and dynamic task queues. In particular, this method is targetedtowards applications that have tasks that are unbalanced and make many irreg-ular accesses. Some of the common applications in this area are sparse linearalgebra kernels (e.g., sparse matrix-vector multiplication and sparse triangularsolve) and graph algorithms (e.g., Betweenness Centrality). Despite these kindsof codes being predominant in high-performance computing, most schedules arenot designed with them in mind.Currently, tuning the parameters for modern many-core systems for optimalperformance can be difficult and not be portable between machines, see Section 5.Making the chunk size smaller allows for more flexibility when the runtimes ofindividual tasks in the chunk various greatly, but at the cost of making morerequests from the centralized queue. Making the chunk size larger reduces thetime in making requests from a centralized queue, but can result in more loadimbalance. Additionally, a single application may have multiple phases, i.e., asubsection of code that has its own unique performance and energy concern [3],and these phases may need to have its own chunk size to have good perfor-mance. Therefore, the choice of best chunk size is dependent on the implementedalgorithm ( kernel ), the hardware microarchitecture ( arch ), input ( input ), andcurrent system load ( system ), i.e., chunk size = f ( kernel, arch, input, system ) . Tuning such a chunk size may be impractical and not reflective of the true systemunder load. As such, a light-weight auto-tuning like iCh provides one solution.
This section describes the current issues and room for improvement related tothe adaptive scheduling of tasks with irregular accesses and execution times. (a) arabic-2005 innatural ordering (b) arabic-2005 inRCM ordering (c) Number of rows binned by to-gether based on nonzero count inincrements of 50 for arabic-2005(y-axis in log-scale)
Fig. 1.
Representations of irregular inputsuto Adaptive Irregular OpenMP Loops 3
Irregular kernels.
Many common applications require calls to kernels (i.e.,important common implementations of key algorithms) such as those that dealwith sparse matrices and graph algorithms. Examples include graph mining andweb crawls. However, these kernels require a great deal of tuning based on boththe computer system and algorithm input to perform optimally. Many differentprogramming models are used to implement these kernels, but one of the mostcommon is a fork-join model. Additionally, many of these kernels are memory-bound even when optimally programmed [23]. This means that many memoryrequests are already waiting to be fulfilled, and additional requests will havehigh latency on an already busy memory system.
Sparse Matrix-Vector Multiplication ( spmv ). spmv is a highly studied and opti-mized kernel due to its importance in many applications [12,16,19,21]. However,the irregular structure of the sparse coefficient matrix makes this difficult. If aone-dimensional layout is applied, the smallest task of work is the multiplica-tions of all nonzeros in a matrix row by the corresponding vector entries summedtogether at the end. Figure 1a presents the nonzero structure (i.e., blue repre-senting nonzero entries and white representing zero entries) of the input matrix arabic-2005 in its natural order. The natural order is the one provided by theinput file, and many times this ordering has some relationship to how elementsare normally processed or the layout on the system. From afar, a static assign-ment of rows may seem like a local choice. To investigate, we bin rows basedon nonzero counts with increments of 50 together, such that the first bin countsthe rows with 1-50 nonzeros and the second bin counts the rows with 51-100nonzeros. In Fig. 1c, we provide the tally of the number of rows in each bin(provided in log scale) for the first 50 bins. We note how much variation exists( σ = 3 . × ) and imbalance of work, such as the last two dots representing bin49 and 50. Additionally, matrices are often preordered based on application toprovide some structure, such as permuting nonzeros towards the diagonal or intoa block structure. One such common permutation is RCM [6]. This little struc-ture can provide some benefits to hand-tuned codes [2, 12, 19] that can use thenewly found structure to better load balance work. However, Fig. 1b shows thatthis could make balancing even harder if rows were assigned linearly. Thoughorderings like RCM may improve execution time [6, 10, 12], these orderings maymake tuning for chunk size more important. Betweenness Centrality ( BC ). The BC metric captures the importance of nodes ina graph and concerns the ratio of all shortest paths to those that pass througha given node. State-of-the-art implementations of BC are normally implementedas multiple parallel breadth-first searches [4, 5, 10, 14]. Therefore, the work of atask will depend on the number of neighboring nodes and the nodes currentlyqueued in the search front. However, relatively good speedups can be achievedwith input reorderings and smart chunk size . Irregular kernel insight.
Despite the irregular nature based on input, theredoes exist some local structure normally. For example, rows or subblocks withina matrix that have a large number of nonzero could be grouped based on someordering. The same can also be applied to graph algorithms like BC . Even if the J.D. Booth given input does not come with this structure, the input can be permuted tohave it. This type of permutation or reordering is commonly done to improveperformance [10, 12, 16, 19]. Therefore, a thread could, in fact, adapt its own chunk size to fit the local task length. Moreover, if a thread is finished withits own work, it could be intelligent in stealing work based on the workload ofothers. This does require some computation overheads such as keeping trackof workload and communicating this workload to nearby neighbors. Since mostof these applications are memory-bound, there does exist a certain amount ofavailability of computational resources and time during computing.
The following steps are considered to construct an adaptive schedule.
Initialization.
Standard methods like dynamic scheduling in libgomp use acentralized queue and a single chunk size for all threads, but do not scale wellwith the number of tasks and threads needing to service many-core systems.Therefore, local queues are constructed for each thread that we denote as q i where i ∈ { , , . . . , p − } is the thread id for p threads used. A local structurethat is memory aligned and allocated using first-touch allocation policy containsa pointer to the local queue, local counter ( k i ), and a variable used to calculate chunk size ( d i ). The tasks are evenly distributed to tasking queues such that | q i | = n/p where n is the total number of tasks. Additionally, k i = 0 and d i = p such that the initial chunk size is n/p , i.e., chunksize = | q i | /d i . The rationalefor this choice is that the scheduler wants to allow a chunk size small enoughthat p -1 threads could steal from later. Moreover, the chunk size is smaller as p increases and allows for the variation of tasks that come with more threads. Local adaption.
In traditional work-stealing methods, the chunk size is fixed,and any load imbalance is mitigated through work-stealing once all the tasks inthe initial queue are already executed [9]. However, a thread can only steal workthat is not already being actively processed, i.e., not in the active chunk. There-fore, making the chunk size too large to start will result in a load imbalance thatthe scheduler may not be able to recover from using work-stealing. Additionally,making the chunk size too small would result in added overhead and possiblymore time to converge.In iCh , work-stealing is still the workhorse for imbalance as in work-stealingmethods seen in the next step. However, iCh tries to locally adapt chunk size to better fit the variation in execution time of tasks, not the load balance. Thisvariation is very important in irregular applications as tasks may vary greatly inthe number of float-point operations and memory requests. Additionally, a singlecore that is mapped to a local thread, and its queue can vary in both voltage,frequency, and memory bandwidth due to load on the system [3]. Because of allthese variations, a static shared chunk size has limitations. Despite iCh ’s goal,it does have an implicit impact on load-balancing as it reflects the argumentsrelated to chunk size in the previous paragraph.This method tries to capture variation into three categories: high, normal,and low. If high, the task’s work length varies more than if low, and a smaller uto Adaptive Irregular OpenMP Loops 5 chunk size will allow for more adaption and possibly work-stealing. The thoughtprocess is the opposite of that for low. The calculation of “true” variation isvery expensive as this requires accurate measurements of time, operations, andmemory results, in addition to a global view of the average. Therefore, a veryrudimentary estimate is used as follows. The local variable k i keeps track of therunning total of the number of tasks completed after a whole chunk is finishedto estimate task length and limit the number of writes. After completion of theassigned chuck, the local thread will determine its load relative to other threadsusing k i . A thread classifies its load relative to other threads as:low: k i < p − (cid:88) j =0 k j p − (cid:15) ; normal: p − (cid:88) j =0 k j p − (cid:15) < k i < p − (cid:88) j =0 k j p + (cid:15) ; high: k i > p − (cid:88) j =0 k j p . In particular, this approximation is simply comparing the thread’s workloadto the average workload of other threads. We note that if iCh ’s goal for chunk size was load balance, then high and low classifications would be flipped, as a threadthat does fewer iterations than average would have heavier tasks. The parameter (cid:15) is added to allow for slight variation and to reduce the number of times chunksize is updated. Through trails, we show in Section 5 that an (cid:15) > / (cid:80) p − j =0 k j /p (i.e., (cid:15) >
33% of the current average) is generally sufficient, and minor changesto (cid:15) has little change to runtime for our kernels. For simplicity, we reference (cid:15) in the remainder of the paper by only the percentage, e.g., (cid:15) = 33%. Thisobservation allows iCh to be used across different applications, systems, andinputs without hand-tuning by the user to achieve “good” speedups. Moreover, k i is the running total and (cid:15) is fixed. As a result of this relationship, iCh is morelikely to only adapt chunk size in the beginning due to extremely large variance,and the possibility of adapting due to smaller variance increases with execution.As noted previously, d i is used to directly adjust the chunk size , i.e., chunk size = | q i | /d i . After classification, d i is adjusted as follows. If the thread is under lowvariation, the number of tasks in a chunk is increased by allowing d i = d i × d i = d i /
2. These increase and decrease are in contrast towhat most may consider. In particular, this update is because of the optimiza-tion goal. The optimization goal is not to have the chunk size converge for eachthread to the same value. In contrast, the goal is for the local chunk size to beadapted to the variation.
Remote work-stealing.
At some point, the local queues will start to run outof work, and then work-stealing is used. Many implementations of work-stealingmethods will fix a chunk size , and use the THE protocol [9, 11] to try to stealand back off if there exists a problem while trying to minimize the number oflocks required. A victim is normally picked at random, and the stealing threadwill normally try to steal half the remaining work from the victim.The iCh method is very similar to the traditional method above. A victimis selected at random and half of the victim’s remaining tasks are stolen. Ad-ditionally, the stealing thread’s d i and k i are updated based on the victim’s d j and k j by taking the average of both, i.e., d i = ( d i + d j ) / k i = ( k i + k j ) / J.D. Booth
The reasoning for this is as follows. The stealing thread knows some informationfrom the victim. However, the stealing thread does not know the accuracy of theinformation and tries to average out the uncertainty with its knowledge.
Test system.
Bridges-RM at Pittsburgh SuperComputing Center [20] is usedfor testing. The system contains two Intel Xeon E5-2695 v3 (Haswell) processorseach with 14 cores and 128GB DDR4-2133. Other microarchitecture, such asIntel Skylake, were also tested, but results did not vary much. We implement iCh inside of GNU libgomp . Codes on Haswell are compiled with GCC 4.8.5(OpenMP 3.1). OpenMP threads are bound to core with
OMP PROC BIND=true and
OMP PLACES=cores . Test inputs.
The same test suite of inputs is used for both spmv and BC . Table 1contains the inputs taken from the SuiteSparse Collection [8], where the numberof vertices and edges are reported in millions (i.e., 10 ). Inputs are picked dueto their size, variation of density, and application areas. In particular, four ap-plications are of particular interest. These areas are: Freescale: a collection fromcircuit simulation of semiconductors; DIMACS: a collection from the DIMACchallenge that is designed to further the development of large graph algorithms;LAW: a collection of laboratory for web algorithms of web crawls to researchdata compression techniques; GenBank: a collection of protein k-mer graphs.Furthermore, we report the average row density (¯ x ), the ratio of the maximalnumber of outgoing edges for a vertex over the minimal number of outgoing edgesfor a vertex ( ratio ), and the variance of the number of outgoing edges ( σ ) foreach input. These numbers provide a sense of how sparse and how unevenlywork is distributed per vertex. Some inputs are very balanced, such as input I8( hugebubbles ). Others have more variance like input I2 ( uk-2005 ). In this section, we observe the numerical results of using three different sched-ules for OpenMP for-loops. These are dynamic (Dyn), a work-stealing (WS), and iCh . OpenMP
Guided and task models were also tested, but they did not pro-vide any insight. The work-stealing method is the same used by iCh , but witha static chunk size . For both dynamic and work-stealing , we test with chunksize in the collection C = { , , , , , } . The performance (i.e., Time T ( i, o, s, c, p ) where i is the input, o is the ordering, s is the schedule, c is chunksize , and p is the number of cores) varies greatly with chunk size , application,and input. Therefore, we often speak of the best time for all tested chunk size as: BestT ( i, o, s, p ) = min c ∈ C T ( i, o, s, c, p ) . Likewise, we define
W orstT as themax and second best (
SecBT ) that will be used throughout this section. Addi-tionally, each timed experiment is repeated 10 times, and the time used in thissection is the average time of the 10 runs. The number of runs is important fortwo reasons. The first is that all runs have a small amount of fluctuation dueto the system. Additionally, all the scheduling tests may change slightly fromrun-to-run as victims are selected by random and read-write orders in dynamic . uto Adaptive Irregular OpenMP Loops 7 Table 1.
Input Graphs. Vertex and edge counts in millions. ¯ x : average number ofoutgoing edges per vertex. ratio : maximal number of outgoing edges over minimalnumber of outgoing edges. σ : variance of the number of outgoing edges. Input Area | V | | E | ¯ x ratio σ I1: FullChip Freescale 2.9 26.6 8.9 1.1e6 3.2e6I2: circuit5M dc Freescale 3.5 14.8 4.2 12 1I3: wikipedia Gleich 3.5 45 12.6 1.8e5 6.2e4I4: patents Pajek 3.7 14.9 3.9 762 31.5I5: AS365 DIMACS 3.7 22.7 5.9 4.6 0.7I6: delaunay n23 DIMACS 8.3 50.3 5.9 7 1.7I7: wb-edu Gleich 9.8 57.1 5.8 2.5e4 2.0e3I8: hugebubbles-10 DIMACS 19.4 58.3 2.9 1 0I9: arabic-2005 LAW 22.7 639.9 28.1 5.7e5 3.0e5I10: road usa DIMACS 23.9 57.7 2.4 4.5 0.8I11: nlpkkt240 Schenk 27.9 760.6 27.1 4.6 4.8I12: uk-2005 LAW 39.4 936.3 23.7 1.7e6 2.7e6I13: kmer P1a GenBank 139.3 297.8 2.1 20 0.4I14: kmer A2a GenBank 170.7 360.5 2.1 20 0.3I15: kmer V1r GenBank 214 465.4 2.1 4 0.3
Ordering.
The execution time, energy usage, and scalability of most irregularapplications are dependent on the input. For our two test applications, ordering isalso important [6,10]. In order to demonstrate this for our runs, we consider boththe RCM ordering and natural ordering (NAT). We define the percent relativeerror due to ordering
REO ( i, d, p ) as: | BestT ( i,RCM,d,p ) − BestT ( i,NAT,d,p ) | BestT ( i,NAT,d,p ) × d is dynamic . Figure 2a presents REO for spmv over the different numberof cores with each dot representing one matrix from the test suite. Note that
REO can be under 10% for some inputs, but the majority of the inputs arelarger. Figure 2b presents
REO for BC over the different number of cores witheach dot representing one matrix from the test suite. Again, we notice a large (a) REO of spmv (b) REO of BC Fig. 2.
The percent relative difference between spmv and BC with best chunk size ranon inputs ordered with RCM and NAT. J.D. Booth error between RCM and NAT order. Overall, both spmv and BC are alwaysfaster when the input is ordered with RCM. The difference is also seen whenthe schedule used is WS in almost the same pattern when dynamic is used (notshown). Additionally, the variation in performance based on chunk size is higherwhen ordered with RCM than with NAT. This variation is partly due to RCMordered inputs running faster, and so overheads are seen. Additionally, manyof the inputs with NAT orderings have a more uniform random distribution ofheavy and light tasks, and a chunk would more likely have both. Therefore, wewill use inputs ordered with RCM for the remainder of this section. Importance of chunk size.
Now we analyze the importance that chunk size has on the performance of our benchmarks. In order to analyze, we use twosimilar metrics as done in the last subsection. The first is to analyze the largestdifference that exists due to the chunk size by fixing the input and schedule.We define
REW ( i, s, p ) as: | W orstT ( i,RCM,s,p ) − BestT ( i,RCM,s,p ) | W orst ( i,RCM,s,p ) × , and fur-ther define M ax - REW ( s, p ) as: max i REW ( i, s, p ) . Likewise, we define
M in - REW ( s, p ) as: min i REW ( i, s, p ) . (a) REW (b)
REB
Fig. 3.
The max and min percent relative difference between the application using thebest chunk size and second best chunk size . Figure 3a presents
M ax - REW ( s, p ) and M in - REW ( s, p ) for both dynamic (i.e., M in - Dyn and
M ax - Dyn , respectively) and work-stealing for both spmv and BC . Note that we only run BC up to 24 cores due to issues with scalingthroughout this paper. From the figure, we observe the worst case for dynamic on both spmv and BC is around 50% or higher. This means that just selecting a chunk size without any tuning or thought can greatly influence the runtime ofboth these applications. On the other hand, for some inputs, the worst case isnot so bad for BC as it is for spmv . For example, using 24 cores for BC , there isone input that percent relative error for time is only 8 . FullChip , but the next smallest percent relative error for 24 coresis 37 .
3% for I6: delaunay n23 .Though the worst performance is a good argument on why chunk size needsto be tuned, it does not capture the difficulty of tuning. In our experiments, we uto Adaptive Irregular OpenMP Loops 9 use a relatively large search space for chunk size (i.e., C ). The cost of generatingsuch a large search space just for spmv and BC over the input for dynamic and work-stealing was on the order of a workweek of computing time for a fixed inputordering. As such, the search process would not scale for a larger search space.Additionally, one may argue that some intelligence, such as auto-tuning witha line search algorithm, could be used to determine chunk size . However, thistype of method is still expensive, and may not provide an optimal chunk size .For example, consider spmv applied with work-stealing to input I9. In this case,the best chunk size on 28 cores is 128, but depending on how the line searchwas setup, the algorithm may never result in testing chunk size
128 as there areother suboptimal solutions between 128 and 512. Lastly, if this application andinput pair is run only a few times, the cost of tuning would highly outweighruntime with an untuned chunk size . To better demonstrate the difference ofruntime of application with even semi-untuned chunk size , we define
REB ( i, s, p )as: | SecBT ( i,RCM,s,p ) − BestT ( i,RCM,s,p ) | SecBT ( i,RCM,s,p ) × , where SecBT is the second bestruntime, and further define
M ax - REB ( s, p ) as: max i REB ( i, s, p ) . Likewise. wedefine
M in - REB ( s, p ) the same but with min .Figure 3b presents these two terms in regards to both dynamic (i.e., M in - Dyn and
M ax - Dyn ) and work-stealing for both spmv and BC . Even thoughour chunk size search space is large, we observe that maximal relative errorfor the time in the best case can be over 20% when dynamic is used as thescheduling method for spmv . For spmv , work-stealing has a small relative errorin both the best and worst cases. However, this argument is flipped completelywhen comparing to BC . This flip further demonstrates how optimal chunk size is dependent on many parameters. iCh sensitivity. Similar to dynamic and work-stealing , iCh is sensitive to appli-cation and input ordering. However, the only simple parameter in the algorithmfor iCh is (cid:15) . We experiment with four different values for (cid:15) , namely 25%, 33%,50%, and 60%. In doing so, we notice that the best runtime across all inputs,applications, and core counts tends to be when (cid:15) is either 33% or 50%. When (cid:15) = 25%, we observe the most cases with the worst runtimes. At 66%, we havefewer best runtimes than either 33% or 50%, but not as many as 25%, and therelative difference in time is better in these worst cases. Overall, we suggest using33% ≤ (cid:15) ≤ spmv , and 50% has one better than 33% in BC . Comparingthe relative error between 33% and 50%, we find the maximal value to be 5 . . REB from the previous subsection,this relative error is better than the Max-WS for larger core counts.
Max speedup.
Here, we evaluate the ability of iCh to speedup an application.For this subsection, we fix the chunk size for dynamic and work-stealing to 128.Though the “optimal” chunk size is dependent on many factors, for most inputsa chunk size of 128 produced an optimal time for dynamic and work-stealing .In particular, for spmv , the best runtime was found for dynamic .
33% of thetime and for work-stealing
60% of the time with a chunk size of 128 on 28 cores.
For BC , the best runtime was found 33 .
33% of the time for dynamic and 53% ofthe time for work-stealing with a chunk size of 128 on 24 cores. Additionally, weremind the reader that the goal of iCh is not to be the “optimal” or improvethe runtime past what can be achieved by work-stealing if it was tuned. Thegoal of iCh is to have close to best performance without having to tune for a chunk size off-line before. We will use (cid:15) = 50% for all these tests, and define
Speedup ( i, s, c, p ) as: T ( i,RCM,dynamic, , T ( i,RCM,s,c,p ) where c is 128 when s is dynamic or work-stealing .In Table 2a, the speedups are presented for spmv for the three differentscheduling methods. We note that in most cases iCh provides a speedup aroundas good or better than that from either dynamic or work-stealing . In severalcases, such as I1 and I3, iCh has the best speedup. We believe that this behavioris an artifact that 128 is not the “optimal” chunk size for dynamic and work-stealing , despite offering the best speedup for the search space. For I1, iCh canachieve a better speedup than work-stealing for all chunk size tested, but we onlytested a finite set. This result is opposed to that of I3 where the best chunk size discovered by the search space is 64 and results in a speedup of 20 . × . Again,this speedup is still smaller than iCh , but we believe this due to not findingthe best chunk size . For I2, we notice that work-stealing can achieve a speedup1 . × better than iCh , which is the largest difference when the chunk size isfixed at 128.In Table 2b, we observe the speedups for BC . The application is more inter-esting as three are more locks and updates that can stall the parallel executionthan spmv . Therefore, BC is expected to not scale well. Overall, iCh still doesvery well, and the speedup for iCh is only smaller than 4 cases. In these fourcases, the difference is very small. However, we notice something interesting. Intwo cases, i.e., I1 and I5, iCh can obtain its maximal speedup using 16 cores andnot 24 cores. In both of these cases, the speedup is worse at 24 cores, yet iCh can have better speedup at 16 cores than dynamic and work-stealing on any corecount tested. We believe that this is an artifact of iCh finding the best chunksize early and the application running out of parallelism on higher core counts.As the application runs out of work from parallelism, the overhead of iCh shows.However, the speedup for iCh on 24 cores for I1 is 2.98 and I5 is 10.3 which areboth close to the best speedup found for work-stealing on the chunk size searchspace. iCh optimal bound. Next, we want to bound how “bad” or how far thespeedup from iCh is the best-found speedup of either work-stealing or dynamic overall chunk size . In doing so, we fix (cid:15) = 50%. We fix this value, because wepresent iCh as an auto-tuning algorithm that will not need user input, eventhough other values were tested and may provide a better speedup for iCh . Wefind that iCh is at most 1 . × and on average 1 . × from the best speedupfrom either dynamic and work-stealing on spmv . This means that iCh on averagehas about the same speedup as the best scheduling method tuned over our chunksize collection. For BC , we find that iCh is at most 1 . × and on average 1 . × from the best speedup from either dynamic and work-stealing . This worst case uto Adaptive Irregular OpenMP Loops 11I Schedule p Speedup SBarI1 iCh -50% 28 11.10WS,128 28 10.93Dyn,128 28 8.68I2 iCh -50% 28 17.20WS,128 28 18.11Dyn,128 28 14.22I3 iCh -50% 28 25.13WS,128 24 8.2Dyn,128 28 17.3I4 iCh -50% 28 18.5WS,128 24 8.29Dyn,128 28 19.1I5 iCh -50% 28 20.6WS,128 28 19.75Dyn,128 28 15.99I6 iCh -50% 28 20.7WS,128 28 21.94Dyn,128 28 17.22I7 iCh -50% 28 20.6WS,128 28 8.3Dyn,128 28 16.28I8 iCh -50% 28 18.68WS,128 28 19.08Dyn,128 28 11.78I9 iCh -50% 28 19.26WS,128 28 20.37Dyn,128 28 18.88I10 iCh -50% 28 21.69WS,128 28 10.3Dyn,128 28 14.08I11 iCh -50% 28 21.74WS,128 28 16.1Dyn,128 24 16.34I12 iCh -50% 28 14.98WS,128 24 13.1Dyn,128 28 15.93I13 iCh -50% 28 22.93WS,128 28 22.53Dyn,128 28 14.75I14 iCh -50% 28 22.43WS,128 28 22.44Dyn,128 28 14.3I15 iCh -50% 28 21.33WS,128 28 23.78Dyn,128 28 19.09(a) Speedup spmv I Schedule p Speedup SBarI1 iCh -50% 16 4.1WS,128 24 2.91Dyn,128 24 2.54I2 iCh -50% 24 16.8WS,128 24 15.9Dyn,128 24 10.34I3 iCh -50% 28 17.3WS,128 24 14.6Dyn,128 24 12.15I4 iCh -50% 28 15.9WS,128 24 12.66Dyn,128 24 3.21I5 iCh -50% 16 12.2WS,128 24 9.63Dyn,128 24 7.06I6 iCh -50% 24 16.57WS,128 24 14.05Dyn,128 24 9.75I7 iCh -50% 24 14.3WS,128 24 14.6Dyn,128 24 12.15I8 iCh -50% 24 12.88WS,128 24 11.12Dyn,128 24 7.77I9 iCh -50% 24 20.32WS,128 24 16.23Dyn,128 24 7.55I10 iCh -50% 24 7.99WS,128 24 8.29Dyn,128 24 6.72I11 iCh -50% 24 20.1WS,128 24 21.4Dyn,128 24 14.56I12 iCh -50% 24 16.98WS,128 24 11.56Dyn,128 24 7.6I13 iCh -50% 24 23.11WS,128 24 20.73Dyn,128 24 14.36I14 iCh -50% 24 22.01WS,128 24 21.18Dyn,128 24 18.01I15 iCh -50% 24 20.1WS,128 24 21.75Dyn,128 24 18.01(b) Speedup BC Table 2.
Speedup with iCh , work-stealing (WS), and dynamic (Dyn). I is the input,p is the number of cores, and SBar is a bar graph of the speedup.2 J.D. Booth for BC is much more surprising than that from spmv . However, this number isdriven by one case in which dynamic does extremely well for a chunk size of32, and no other scheduling method and the chunk size can compare. Overall,Table 2b provides a much better average look at the performance for iCh on BC . Work by Yong, Jin, and Zhang [24] add a dynamic history to decide about load-balance on distributed shared memory systems. The adaptive chunk size in thelocal adaptive schedule in our algorithm is an extension of this work. However,we note that we are optimizing for variance and they are for load-balance, andas a result, the inequalities in their classifications are in the opposite direction as iCh . In particular, the older work considered how to keep and update a historyon distributed shared-memory systems, such as KSR-1 and Convex up to 16CPUs, that could have high delays and a different memory system of today’smodern systems. In [1], loops are scheduled in a distributed fashion with MPI.The chunk size is determined by a direct fraction of the cumulative numberof tasks complete and the processor speed. The KASS system [22] considersadaptively chunking together in the static or initialization phase. Chunks inthe second (dynamic phase) are reduced fixedly based on information from pastiteration runs but are not adapted within an iteration like iCh . Chunks are stolenif a queue runs out of its own chunks. A history-aware [13] studies chunk size from past iteration and the number of times the task will be ran using a muchmore complex “best-fit” approximation fit. This proves benefits for loops thatare repeated, but iCh does not consider this as i regular kernels may not repeatloops as in spmv . Lastly, BinLPT [15] schedules irregular tasks from a loop usingan estimate of the work in each loop and a maximal number of chunks providedby the user. This method is one of the newest and provides good performance inpublication. In contrast, iCh wants to provide an easier method that does notrequire estimates of loop work and more user input.
This work develops an adaptive OpenMP loop scheduler for work imbalancedirregular applications by adaptively tuning chunk size and using work-stealing.The method uses a force-feedback control system that analyzes an approximationto the variance in task lengths assigned in a chunk. Though rudimentary, this sys-tem has relatively low overheads and allows for performance that is comparableto fine tuning over a large chunk size collection for sparse matrix-vector multi-plication ( spmv ) and Betweenness Centrality ( BC ). In particular, we demonstratethat iCh is on average 1 . × from the best speedup achieved by either the tradi-tional dynamic or work-stealing schedules for spmv when these two schedules aretuned over a relatively large collection of chunk size . We also demonstrate that iCh is on average 1 . × from the best speedup achieved by either the dynamic or work-stealing schedule for BC . Additionally, we observe that iCh can reducethe variation in runtime that exists in a work-stealing method that randomlyselects its victim. uto Adaptive Irregular OpenMP Loops 13 References
1. Banicescu, I., Velusamy, V., Devaprasad, J.: On the scalability of dynamic schedul-ing scientific applications with adaptive weighted factoring. Cluster Computing (3), 215–226 (2003)2. Booth, J.D., Ellingwood, N.D., Thornquist, H.K., Rajamanickam, S.: Basker: Par-allel sparse LU factorization utilizing hierarchical parallelism and data layouts.Parallel Comput. , 17–31 (2017)3. Booth, J.D., Kotra, J., Zhao, H., Kandemir, M., Raghavan, P.: Phase detectionwith hidden markov models for DVFS on many-core processors. In: 2015 IEEE 35thInternational Conference on Distributed Computing Systems. IEEE (jun 2015).https://doi.org/10.1109/icdcs.2015.274. Brandes, U.: A faster algorithm for betweenness centrality. TheJournal of Mathematical Sociology (2), 163–177 (jun 2001).https://doi.org/10.1080/0022250x.2001.99902495. Brandes, U.: On variants of shortest-path betweenness centrality andtheir generic computation. Social Networks (2), 136–145 (may 2008).https://doi.org/10.1016/j.socnet.2007.11.0016. Cuthill, E., McKee, J.: Reducing the bandwidth of sparse symmet-ric matrices. In: Proceedings of the 1969 24th National Conference. p.157172. ACM 69, Association for Computing Machinery, New York, NY,USA (1969). https://doi.org/10.1145/800195.805928, https://doi.org/10.1145/800195.805928
7. Dagum, L., Menon, R.: Openmp: an industry standard api for shared-memoryprogramming. Computational Science & Engineering, IEEE (1), 46–55 (1998)8. Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM TOM (1), 1:1–1:25 (2011)9. Durand, M., Broquedis, F., Gautier, T., Raffin, B.: An efficient OpenMP loopscheduler for irregular applications on large-scale NUMA machines. In: OpenMPin the Era of Low Power Devices and Accelerators, pp. 141–155. Springer BerlinHeidelberg (2013). https://doi.org/10.1007/978-3-642-40698-01110. Frasca, M., Madduri, K., Raghavan, P.: NUMA-aware graph mining techniquesfor performance and energy efficiency. In: 2012 International Conference for HighPerformance Computing, Networking, Storage and Analysis. IEEE (nov 2012).https://doi.org/10.1109/sc.2012.8111. Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the cilk-5 mul-tithreaded language. In: Proceedings of the ACM SIGPLAN 1998 conference onProgramming language design and implementation - PLDI '
98. ACM Press (1998).https://doi.org/10.1145/277650.27772512. Kabir, H., Booth, J.D., Raghavan, P.: A multilevel compressed sparse row for-mat for efficient sparse computations on multicore processors. In: 2014 21st Inter-national Conference on High Performance Computing (HiPC). IEEE (dec 2014).https://doi.org/10.1109/hipc.2014.711688213. Kejariwal, A., Nicolau, A., Polychronopoulos, C.D.: History-aware self-scheduling.In: 2006 International Conference on Parallel Processing (ICPP’06). pp. 185–192(2006)14. Madduri, K., Ediger, D., Jiang, K., Bader, D.A., Chavarria-Miranda, D.: Afaster parallel algorithm and efficient multithreaded implementations for eval-uating betweenness centrality on massive datasets. In: 2009 IEEE Interna-tional Symposium on Parallel & Distributed Processing. IEEE (may 2009).https://doi.org/10.1109/ipdps.2009.51611004 J.D. Booth15. Penna, P.H., A. Gomes, A.T., Castro, M., DM Plentz, P., C. Freitas, H., Broquedis,F., M´ehaut, J.F.: A comprehensive performance evaluation of the binlpt workload-aware loop scheduler. Concurrency and Computation: Practice and Experience (18), e5170 (2019)16. Pinar, A., Heath, M.T.: Improving performance of sparse matrix-vectormultiplication. In: Proceedings of the 1999 ACM/IEEE conference onSupercomputing (CDROM) - Supercomputing '
99. ACM Press (1999).https://doi.org/10.1145/331532.33156217. Schmidl, D., Cramer, T., Wienke, S., Terboven, C., M¨uller, M.S.: Assessing the per-formance of openmp programs on the intel xeon phi. In: Proceedings of the 19th In-ternational Conference on Parallel Processing. pp. 547–558. Euro-Par’13, Springer-Verlag, Berlin, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40047-65618. Schmidl, D., Philippen, P., Lorenz, D., R¨ossel, C., Geimer, M., an Mey, D., Mohr,B., Wolf, F.: Performance analysis techniques for task-based openmp applications.In: Proceedings of the 8th International Conference on OpenMP in a Heteroge-neous World. pp. 196–209. IWOMP’12, Springer-Verlag, Berlin, Heidelberg (2012).https://doi.org/10.1007/978-3-642-30961-81519. Toledo, S.: Improving the memory-system performance of sparse-matrix vectormultiplication. IBM Journal of Research and Development (6), 711–725 (nov1997). https://doi.org/10.1147/rd.416.071120. Towns, J., Cockerill, T., Dahan, M., Foster, I., Gaither, K., Grimshaw, A., Hazle-wood, V., Lathrop, S., Lifka, D., Peterson, G.D., Roskies, R., Scott, J.R., Wilkins-Diehr, N.: XSEDE: Accelerating scientific discovery. Computing in Science & En-gineering (5), 62–74 (Sept-Oct 2014)21. Vuduc, R.W., Moon, H.: Fast sparse matrix-vector multiplication by exploitingvariable block structure. Tech. rep. (jul 2005). https://doi.org/10.2172/89170822. Wang, Y., Ji, W., Shi, F., Zuo, Q., Deng, N.: Knowledge-based adaptive self-scheduling. In: Park, J.J., Zomaya, A.Y., Yeo, S., Sahni, S. (eds.) Network andParallel Computing, 9th IFIP International Conference, NPC 2012, Gwangju,Korea, September 6-8, 2012. Proceedings. Lecture Notes in Computer Science,vol. 7513, pp. 22–32. Springer (2012). https://doi.org/10.1007/978-3-642-35606-3 3, https://doi.org/10.1007/978-3-642-35606-33
23. Williams, S., Waterman, A., Patterson, D.: Roofline: An insightful visual perfor-mance model for multicore architectures. Commun. ACM (4), 65–76 (Apr 2009).https://doi.org/10.1145/1498765.149878524. Yan, Y., Jin, C., Zhang, X.: Adaptively scheduling parallel loops in distributedshared-memory systems. IEEE Transactions on Parallel and Distributed Systems8