[PDF] Supporting OpenMP 5.0 Tasks in hpxMP -- A study of an OpenMP implementation within Task Based Runtime Systems

Abstract

OpenMP has been the de facto standard for single node parallelism for more than a decade. Recently, asynchronous many-task runtime (AMT) systems have increased in popularity as a new programming paradigm for high performance computing applications. One of the major challenges of this new paradigm is the incompatibility of the OpenMP thread model and other AMTs. Highly optimized OpenMP-based libraries do not perform well when coupled with AMTs because the threading of both libraries will compete for resources. This paper is a follow-up paper on the fundamental implementation of hpxMP, an implementation of the OpenMP standard which utilizes the C++ standard library for Parallelism and Concurrency (HPX) to schedule and manage tasks. In this paper, we present the implementation of task features, e.g. taskgroup, task depend, and task_reduction, of the OpenMP 5.0 standard and optimization of the #pragma omp parallel for pragma. We use the daxpy benchmark, the Barcelona OpenMP Tasks Suite, Parallel research kernels, and OpenBLAS benchmarks to compare the different OpenMp implementations: hpxMP, llvm-OpenMP, and GOMP.

Full PDF

SS UPPORTING O PEN

MP 5.0 T

ASKS IN HPX

MP - A

STUDY OF AN O PEN MP IMPLEMENTATION WITHIN T ASK B ASED R UNTIME S YSTEMS

A P

REPRINT

Tianyi Zhang, Shahrzad Shirzad, Bibek Wagle, Adrian S. Lemoine, Patrick Diehl , and Hartmut Kaiser

Center of Computation & TechnologyLouisiana State UniversityBaton Rouge, LA { tzhan18,sshirz1,bwagle3 } @lsu.edu, { aserio,pdiehl,hkaiser } @cct.lsu.edu February 20, 2020 A BSTRACT

OpenMP has been the de facto standard for single node parallelism for more than a decade. Recently,asynchronous many-task runtime (AMT) systems have increased in popularity as a new program-ming paradigm for high performance computing applications. One of the major challenges of thisnew paradigm is the incompatibility of the OpenMP thread model and other AMTs. Highly opti-mized OpenMP-based libraries do not perform well when coupled with AMTs because the threadingof both libraries will compete for resources. This paper is a follow-up paper on the fundamental im-plementation of hpxMP, an implementation of the OpenMP standard which utilizes the C++ standardlibrary for Parallelism and Concurrency (HPX) to schedule and manage tasks [1]. In this paper, wepresent the implementation of task features, e.g. taskgroup , task depend , and task_reduction , of the OpenMP 5.0 standard and optimization of the pragma. Weuse the daxpy benchmark, the Barcelona OpenMP Tasks Suite, Parallel research kernels, and Open-BLAS benchmarks to compare the different OpenMp implementations: hpxMP, llvm-OpenMP, andGOMP. We conclude that hpxMP is one possibility to overcome the competition for resources of thedifferent thread models by providing a subset of the OpenMP features using HPX threads. However,the overall performance of hpxMP is not yet comparable with legacy libraries, which are highlyoptimized for a different programming paradigm and optimized over a decode by many contributorsand compiler vendors. K eywords OpenMP,hpxMP,Asynchronous Many-task Systems,C++,clang,gcc,HPX

Asynchronous many-task (AMT) systems have emerged as a new programming paradigm in the high performancecomputing (HPC) community [2]. Many of these applications would beneﬁt from the highly optimized OpenMP-based linear algebra libraries currently available. e.g. eigen, blaze, Intel MKL. However, there is a gap betweenOpenMP and AMT systems, since the user level threads of the AMT systems interfere with the system threads ofOpenMP preventing efﬁcient execution of the application.To close this gap, this paper introduces hpxMP, an implementation of the OpenMP standard, which utilizes a C++standard library for parallelism and concurrency (HPX) [3] to schedule and manage tasks. HpxMP implements theOpenMP standard conform pragma [1]. Furthermore, the OpenMP standard has, sinceOpenMP . [4], begun to introduce task-based concepts such as depended tasks and task groups (OpenMP 4.0 [5]),task-loops (OpenMP 4.5 [6]), and detached tasks (OpenMP 5.0 [7]). This work extends hpxMP with the pragma to provide the concept of task-based programming and validates its implementation against the a r X i v : . [ c s . D C ] F e b PREPRINT - F

EBRUARY

20, 2020Barcelona OpenMP Tasks Suite. Next, hpxMP is validated against the daxpy benchmark, OpenBLAS, and ParallelResearch Kernels benchmarks to compare the different OpenMP implementations: hpxMP, llvm-OpenMP, and GOMP.This paper is structured as follows: Section 2 covers the the related work. Section 3 brieﬂy introduces the featuresof HPX utilized for the implementation of tasks within hpxMP. Section 4 emphasizes the implementation of OpenMPtasks within the HPX framework and shows the subset of OpenMP standard features implemented within hpxMP.Section 5 shows the comparison of the different OpenMP implementations and ﬁnally, Section 6 concludes the work.

Multi-threading solutions

For the exploitation of shared memory parallelism on multi-core processors many solutions are available and havebeen intensively studied. The most language independent one is the POSIX thread execution model [8] which exposesﬁne grain parallelism. In addition, there are more abstract library solutions like Intel’s Threading Building Blocks(TBB) [9], Microsoft’s Parallel Pattern Library (PPL) [10], and Kokkos [11]. TBB provides task parallelism using aC++ template library. PPL provides in addition features like parallel algorithms and containers. Kokkos provides acommon C++ interface for parallel programming models, like CUDA and pthreads. Programming languages such asChapel [12] provide parallel data and task abstractions. Cilk [13] extends the C/C++ language with parallel loop andfork-join constructs to provide single node parallelism. Fore a very detailed review we refer to [2].With the OpenMP 3.0 standard [4] the concept of task-based programming was added. The OpenMP 3.1 [14] standardintroduced task optimization. Depend tasks and task groups provided by the OpenMP 4.0 standard [5] improved thesynchronization of tasks. The OpenMP 4.5 [6] standard allows task-based loops and the OpenMP 5.0 [7] standardallows detaching of tasks, respectively.

Integration of Multi-threading solutions within distributed programming models.

Some major research in the area of

MPI+X [15–18], where the Message Passing Interface (MPI) is used as the dis-tributed programming model and OpenMP as the multi-threaded shared memory programming model has be done.However, less research has be done for

AMT+X , where asynchronous many-task systems are used as the distributedprogramming model and OpenMP as the shared memory programming model. Charm++ integrated OpenMP’s sharedmemory parallelism to its distributed programming model for improving the load balancing [19]. Kstar, a researchC/C++ OpenMP compiler [20], was utilized to generate code compatible with StarPU [21] and Kaapi [22]. Only Kaapiimplements the full set of OpenMP speciﬁcation [23], such as the capability to create task into the context of a taskregion. The successor XKaapi [24] provides C++ task based interface for both multi-core and multi-GPUs and onlyKaapi provides mult-cluster support.

HPX is an open source C++ standard conferment library for parallelism and concurrency for applications of anyscale. One specialty of HPX is that its API offers a syntactic and semantic equivalent interface for local and remotefunction calls. HPX incorporates well known concepts, e.g. static and dynamic dataﬂows, ﬁne-grained futures-basedsynchronization, and continuation-style programming [25]. Fore more details we refer to [25–31].After this brief overview, let us look into the relevant parts of HPX in the context of this work. HPX is an implemen-tation of the ParalleX execution model [32] and generates hundreds of millions of so-called light-weight HPX-threads(tasks). These light-weight threads provide fast context switching [3] and lower overloads per thread to make it fea-sible to schedule a large number of tasks with negligible overhead [33]. However, these light-weighted HPX threadsare not compatible with the user threads utilized by the current OpenMP implementation. For more details we refer toour previous paper [1]. This works adds support of OpenMP 5.0 task features to hpxMP. With having a light-weightedHPX thread-based implementation of the OpenMP standard it enables applications that already use HPX for local anddistributed computations to integrate highly optimized libraries that rely on HPX in future. hpxMP is an implementation of the OpenMP standard, which utilizes a C++ standard library for parallelism andconcurrency (HPX) [3] to schedule and manage threads and tasks. We have described the fundamental implementationof hpxMP in previous work [1]. This section addresses the implementation of few important classes in hpxMP, task2

PREPRINT - F

EBRUARY

20, 2020features, such as taskgroup , task depend , and task_reduction in the OpenMP . standard [7], within hpxMPand its optimization for the thread and task synchronization which are the new contribution of this work. An instance of omp_task_data class is set to be associated with each HPX thread by calling hpx::threads::set_thread_data . Instances of omp_task_data are passed by a raw pointer which is reinterpret_cast ed to size_t . For better memory management, a smart pointer boost::intrusive_ptr is introduced to wrap around omp_task_data . The class omp_task_data consists the information describing a thread, such as a pointer to thecurrent team , taskLatch for synchronization and if the task is in taskgroup . The omp_task_data can be retrievedby calling hpx::threads::get_thread_data when needed, which plays an important role in hpxMP runtime.Another important class is parallel_region , containing information in a team , such as teamTaskLatch for tasksynchronization, number of threads requested under the parallel region, and the depth of the current team. Explicit tasks are created using the task construct in hpxMP. hpxMP has implemented the most recent OpenMP 5.0tasking features and synchronization constructs, like task , taskyield , and taskwait . The supported clause associ-ated with are reduction , untied , private , firstprivate , shared , and depend .Explicit tasks are created using in hpxMP. HPX threads are created with the task directives andtasks are running on these HPX threads created. _kmpc_omp_task_alloc is allocating, initializing tasks and thenreturn the generated tasks to the runtime. __kmpc_omp_task is called with the generated task parameter and passedto the hpx_runtime::create_task . The tasks are then running as a normal priority HPX thread by calling function hpx::applier::register_thread_nullary , see Listing 1. Synchronization in tasking implementation of hpxMPare handled with HPX latch, which will be discussed later in Section 4.3.Task dependency was introduced with OpenMP 4.0. The depend clause is . Certain dependency should be satisﬁed among tasks speciﬁed by users. Inthe implementation, future in HPX is employed. The functionality called hpx::future allows for the separation ofthe initiation of an operation and the waiting for the result. A list of tasks that current task depend on are stored in a vector> and hpx::when_all(dep_futures) are called to inform the current task whenit is ready to run.OpenMP 5.0 added great extension to the tasking structure in OpenMP. task_reduction along with in_reduction gives users a way to tell the compiler reduction relations among tasks and specify the tasks in taskgroup whichare participating the reduction. The implementation of taskgroup can be found in Listing 2. Reduction datais handled in kmpc_task_reduction_init , by assigning them to the taskgroups, and return the taskgroup databack to the runtime. tells the runtime which task isparticipating the reduction, and retrieves the reduction data by calling __kmpc_task_reduction_get_th_data . kmp_task_reduction_fini is called by kmpc_end_taskgroup , cleaning memory allocated and ﬁnish the task re-duction properly. In this work we improved the performance of hpxMP over previous versions [1] by optimizing the control struc-tures used for thread synchronization. Previously, an exponential back-off is used for thread synchronization.Now, HPX latch, see Listing 3, provides an easier to use and more efﬁcient way to manage thread and task syn-chronization originally proposed in the draft C++ library Concurrency Technical speciﬁcation . Latch in HPX isimplemented with mutex, condition variable, and locks however is well-designed and higher level. An internalcounter is initialized in a latch to keep track of a calling thread needs to be blocked. The latch blocks one ormore threads from executing until the counter reaches 0. Several member functions such as wait() , count_up() , count_down() , count_down_and_wait() of the Latch class is provided. The difference between count_down() and count_down_and_wait() is if the thread will be blocked if the data member inside Latch is not equalto 0 after decreasing the counter by 1. In parallel regions, when one thread is spawning a team of threads, anHPX latch called threadLatch will be initialized to threads_requested+1 and member function threadLatch.count_down_and_wait() is called by the parent thread after threads are spawned, making parent threads wait forchild threads to ﬁnish their work. The Latch is passed as a reference to each child thread and the member function https://github.com/cplusplus/concurrency-ts/blob/master/latch barrier.html PREPRINT - F

EBRUARY

20, 2020Listing 1: Implementation of task scheduling in hpxMP kmp_task_t * __kmpc_omp_task_alloc (... ) { kmp_task_t *task = ( kmp_task_t *) new char[ task_size + sizeof_shareds ]; // lots of initilization goes on here return task; } void hpx_runtime :: create_task ( kmp_routine_entry_t task_func , int gtid , intrusive_ptr kmp_task_ptr ){ auto current_task_ptr = get_task_data (); // this is waited in taskwait , wait for all tasks before taskwait created to be done // create_task function is not supposed to wait anything current_task_ptr -> taskLatch . count_up (1); // count up number of tasks in this team current_task_ptr ->team -> teamTaskLatch . count_up (1); // count up number of task in taskgroup if we are under taskgroup construct if( current_task_ptr -> in_taskgroup ) current_task_ptr -> taskgroupLatch -> count_up (1); // Create a normal priority HPX thread with the allocated task as argument . hpx :: applier :: register_thread_nullary (.....) return 1; } Listing 2: Implementation of kmpc taskgroup and kmpc end taskgroup in hpxMP void __kmpc_taskgroup ( ident_t * loc , int gtid ) { auto task = get_task_data (); intrusive_ptr < kmp_taskgroup_t > tg_new (new kmp_taskgroup_t ()); tg_new -> reduce_num_data = 0; task -> td_taskgroup = tg_new ; task -> in_taskgroup = true; task -> taskgroupLatch .reset(new latch (1)); } void __kmpc_end_taskgroup ( ident_t * loc , int gtid ) { auto task = get_task_data (); task -> tg_exec .reset (); task -> taskgroupLatch -> count_down_and_wait (); task -> in_taskgroup = false; auto taskgroup = task -> td_taskgroup ; if (taskgroup -> reduce_data != NULL) __kmp_task_reduction_fini (nullptr , taskgroup ); } threadLatch.count_down() is called by each child thread when their works are done. When all the child threadshave called the member function, the internal counter of threadLatch will be reduced to 0 and the thread will be re-leased. For task synchronization, the implementation is trickier and needs to be carefully designed. In Listing 1, threeLatches taskLatch , teamTaskLatch , and taskgroupLatch are count_up(1) when a task is created. Based on thedeﬁnition of OpenMP standard, tasks are not necessarily synchronized unless a or is called either explicitly or implicitly, see Listing 4. The member function of Latch count_down(1) is called when a task is done with its work. TaskLatch only matters when is speciﬁed,where taskLatch.wait() is called, making sure the current task is suspended until all child tasks that it generatedbefore the taskwait region complete execution. The teamTaskLatch is used to synchronize all the tasks under a team,including all child tasks this thread created and all of their descendant tasks. An implicit barrier is always triggered atthe end of parallel regions, where team->teamTaskLatch.wait() is called and the current task can be suspended.Taskgroup implementation in hpxMP is similar to a barrier, see Listing 2. All tasks under the same taskgroup areblocked until the taskgroupLatch->count_down_and_wait() function inside kmpc_end_taskgroup is called byall child tasks and their descend tasks.

This sections summarizes the previous presented features of the OpenMP standard implemented with hpxMP. Table 1shows the directives provided by the program layer and correspond to the main part of the presented library. Table 2shows the runtime library functions of the OpenMP standard provided by hpxMP. Of course, the pragmas and runtime4

PREPRINT - F

EBRUARY

20, 2020Listing 3: Deﬁnation of Latch Class in HPX class Latch { public : void count_down_and_wait (); void count_down (std :: ptrdiff_t n); bool is_ready () const noexcept ; void wait () const; void count_up (std :: ptrdiff_t n); void reset(std :: ptrdiff_t n); protected : mutable util :: cache_line_data mtx_; mutable util :: cache_line_data cond_; std :: atomic counter_ ; bool notified_ ; } Listing 4: Implementation of taskwait and barrier wait void hpx_runtime :: task_wait () { auto task = get_task_data (); intrusive_ptr < omp_task_data > task_ptr (task); task_ptr -> taskLatch .wait (); } void hpx_runtime :: barrier_wait () { auto *team = get_team (); task_wait (); // wait for all child tasks to be done team -> teamTaskLatch .wait (); } library functions are only a subset of the OpenMP speciﬁcation, but one step to bridge the compatibiltiy gap betweenOpenMP and the HPX runtime system. In this paper, the Daxpy Benchmark, Parallel Research Kernels and Barcelona OpenMP Tasks Suite are used to com-pare the performance between three different implementations: hpxMP, llvm-OpenMP, and GOMP, which are providedby the authors, Intel, and GNU project respectively. The threads are pinned under each measurement for llvm-OpenMPand GOMP. The Blazemark benchmarks from the authors previous work [1] are rerun to emphasize the recent im-provements of performance. The benchmarks are tested on Marvin ( x Intel ® Xeon ® CPU E5-2450 0 @ 2.10GHzand GB RAM), a node having physical cores in two NUMA domains.The versions of Clang, GCC, LLVM OpenMP and GOMP used were 8.0.0, 9.1.0, 4.5 and 4.5 respectively. We usedhpxMP with commit id d9234c2, HPX with commit id 414380e, Blaze 3.4 , Boost 1.70 and gperftools 2.7. Theoperating system used was CentOS 7.6.1810 with kernel 3.10. https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks https://bitbucket.org/blaze-lib/blaze Table 1: Directives implemented in the program layer of hpxMP. These functions correspond to the main part of thepresented library. 5

PREPRINT - F

EBRUARY

20, 2020 omp_get_dynamic omp_get_max_threadsomp_get_num_procs omp_get_num_threadsomp_get_thread_num omp_get_wtickomp_get_wtime omp_in_parallelomp_init_lock omp_init_nest_lockomp_set_dynamic omp_set_lockomp_set_nest_lock omp_set_num_threadsomp_test_lock omp_test_nest_lockomp_unset_lock omp_unset_nest_lock

Table 2: Directives implemented in the program layer of hpxMP. These functions correspond to the main part of thepresented library.

In order to compare the performance of , which is a fundamental pragma in OpenMP,Daxpy benchmark is used in this measurement. Daxpy is a benchmark that measures the multiplication of a ﬂoatnumber c with a dense vector a consists 32 bit ﬂoating numbers, then add the result with another dense vector b (32bit ﬂoat), the result is stored in the same vector b , where c ∈ R and a, b ∈ R n .The Daxpy benchmark compares the performance calculated in Mega Floating Point Operations Per Second(MFLOP/s). We determine the speedup of the application by scaling our results to the single-threaded run of thebenchmark using hpxMP.Figure 1 shows the speedup ratio with different numbers of threads. Our ﬁrst experiment compared the performanceof the OpenMP implementations when the vector size was set to , see Figure 1d. llvm-OpenMP runs the fastestwhile following with GOMP and hpxMP. Figure 1c shows that with a vector size of , GOMP and llvm-OpenMPare still able to exhibit some scaling while hpxMP struggles to scale past threads. For very large vector sizes of and , the three implementations perform almost identically. hpxMP is able to scale in these scenarios because thereis sufﬁcient work in each task in order to amortize the cost of the task management overheads. We chose to use the DGEMM benchmark from Parallel Research Kernels [34] to test our implementation. Thepurpose of the DGEMM program is to test the performance doing a dense matrix multiplication. The DGEMMbenchmark compares the performance calculated in execution time(seconds). Figure 2 shows the execution time withdifferent numbers of threads. The performance of the OpenMP implementations when the matrix size was set to isshown in Figure 2a. hpxMP and llvm-OpenMP runs perform similar while both outperform GOMP. Figure 2b showsthat with a matrix size of , GOMP and llvm-OpenMP are still able to exhibit some scaling while hpxMP strugglesto scale past threads and is slower that GOMP after 8 threads. We chose to use the fast parallel sorting variation of the ordinary mergesort [35] of the Barcelona OpenMP Tasks Suiteto test our implementation of tasks. We sorted a random array with to . The cut off value determines when to perform serial quicksort instead of dividing the array into portions recursively when tasks are created. Higher cut off values create larger size of tasks and, therefore, fewer tasksare created. In order to simplify the experiment, parallel merge is disabled and the threshold for insertion sort is set to in this benchmark. For each cut off value, the execution time of hpxMP using thread is selected as the base pointto calculate speedup values. Figure 3 shows the speedup ratio when using different numbers of threads.For the cut off value (Figure 3a), the array is divided into four sections and four tasks in total are created. Thespeed up curve rapidly increases when moving from threads to , but no signiﬁcant speedup is achieved when usingmore than threads in all three implementations. HpxMP and llvm-OpenMP show comparable performance whileGOMP is slower.The cut off value of (Figure 3b) increases the number of tasks generated. In this case, llvm-OpenMP has aperformance advantage while hpxMP and GOMP show comparable performance. https://github.com/ParRes/Kernels PREPRINT - F

EBRUARY

20, 2020 number of threads s p ee d u p ( v e c t o r s i z e ^ ) hpxMPllvm-OpenMPGOMP (a) number of threads s p ee d u p ( v e c t o r s i z e ^ ) hpxMPllvm-OpenMPGOMP (b) number of threads s p ee d u p ( v e c t o r s i z e ^ ) hpxMPllvm-OpenMPGOMP (c) number of threads s p ee d u p ( v e c t o r s i z e ^ ) hpxMPllvm-OpenMPGOMP (d) Figure 1: Scaling plots for the Daxpy benchmarks running with different vector sizes: (a) , (b) , (c) , and(d) . Larger vector sizes mean larger tasks are created. The speedup is calculated by scaling the execution time ofa run by the execution time of the single threaded run of hpxMP. A larger speedup factor means a smaller executiontime of the sample. number of threads (vector size 1000 ) s hpxMPllvm-OpenMPGOMP (a) cut off number of threads (vector size 100 ) s hpxMPllvm-OpenMPGOMP (b) cut off Figure 2: Scaling plots for the Parallel Research Kernels DGEMM Benchmarks ran with the vector size of: (a) ,(b) . The relation between time consumed and the number of threads using hpxMP, llvm-OpenMP, and GOMPare plotted. The time consumed is calculated by the execution time (s). A smaller time execution time means betterperformance of the sample. 7

PREPRINT - F

EBRUARY

20, 2020 number of threads s p ee d u p ( c u t o ff ^ ) hpxMPllvm-OpenMPGOMP (a) cut off number of threads s p ee d u p ( c u t o ff ^ ) hpxMPllvm-OpenMPGOMP (b) cut off number of threads s p ee d u p ( c u t o ff ^ ) hpxMPllvm-OpenMPGOMP (c) cut off number of threads s p ee d u p ( c u t o ff ) hpxMPllvm-OpenMPGOMP (d) cut off Figure 3: Scaling plots for the Barcelona OpenMP Task suit’s Sort Benchmarks ran with the cut off values of: (a) ,(b) , (c) , and (d) . Higher cut off values indicate a smaller number of larger tasks are created. The speed upis calculated by scaling the execution time of a run by the execution time of the single threaded run of hpxMP.For the cut off value (Figure 3c), llvm-OpenMP shows a distinct performance advantage over hpxMP and GOMP.Nevertheless, hpxMP still scales across all the threads while GOMP has ceases to scale past threads.For a cut off value of (Figure 3d), a signiﬁcant number of tasks are created and the work for each task is considerablysmall. Here, hpxMP does not scale due to the the large amount of overheads associated with the creation of manyuser tasks. Because each task performs little work, the overhead that they create is not amortized by the increase inconcurrency.For a global view, the speedup ratio r is shown in Figure 4, where the larger the heatmap value is, the better perfor-mance OpenMP has achieved in comparison to hpxMP. Values below mean that hpxMP outperforms the OpenMPimplementation. As shown in the heatmap, llvm-OpenMP works best when the task granularity is small and the num-ber of tasks created is high. GOMP is slower than both implementations in most cases. For large task sizes, hpxMPis comparable with llvm-OpenMP (Figure 4a). This result demonstrates that when the grain size of the task is chosenwell hpxMP will not incur a performance penalty. Here, some more research has to be done on how hpxMP can handletask granularity and limit the overhead in task management for small grain sizes. Some related work can be foundhere [23, 24, 36–38]. In this section, the dmatdmatadd benchmark from Blaze’s benchmark suite is rerun to demonstrate the recent im-provements in performance when compared to the authors previous work [1] . Blaze [39] is a high performance C++linear algebra library which can use different backends for parallelization. It also provides a benchmark suite calledBlazemark for comparing the performance of several linear algebra libraries, as well as different backends used byBlaze, for a selection of arithmetic operations. The results obtained from dmatdmatadd are presented and graphs areillustrated for a speciﬁc number of cores ( , , , and ) accordingly. The series in the graphs are obtained by runningthe benchmark with llvm-OpenMP, an older version of hpxMP, and the current state of hpxMP [1]. https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks PREPRINT - F

EBRUARY

20, 2020 number of threads c u t o ff r a t i o r (a) llvm-OpenMP/hpxMP number of threads c u t o ff r a t i o r (b) GOMP/hpxMP Figure 4: Speedup Ratio of Barcelona OpenMP Task suit’s Sort Benchmark ran over several threads and cut off valuesusing the hpxMP, llvm-OpenMP, and GOMP implementations. Values greater than mean that the OpenMP imple-mentation achieved better performance when compared to hpxMP. Values below indicate that hpxMP outperformedthe OpenMP implementation. size n M F l o p s ( t h r e a d s ) hpxMP previoushpxMPOpenMP (a) size n M F l o p s ( t h r e a d s ) hpxMP previoushpxMPOpenMP (b) size n M F l o p s ( t h r e a d s ) hpxMP previoushpxMPOpenMP (c) size n M F l o p s ( t h r e a d s ) hpxMP previoushpxMPOpenMP (d) Figure 5: Scaling plots for dmatdmatadd Benchmarks for different number of threads, compared to llvm-OpenMP andprevious work of the authors [1]: (a) , (b) , (c) , and (d) PREPRINT - F

EBRUARY

20, 2020The Dense Matrix Addition benchmark (dmatdmatadd) adds two dense matrices A and B , where A, B ∈ R n × n ,and writes the result in matrix C ∈ R n × n . The operation can be written as C [ i, j ] = A [ i, j ] + B [ i, j ] . Blaze usesthe threshold of , elements, which corresponds to matrix size by , before parallelizing the operation.Matrices with less than , elements are added sequentially. Figure 5 demonstrates the new scaling results for thedmatdmatadd benchmark using , , , and cores. We observe notable improvement in performance between theprevious version of hpxMP and the current version. The performance more closely mimic that of llvm-OpenMP. With the results presented above, we showed that hpxMP has similar performance to llvm-OpenMP for larger inputsizes. For some speciﬁc cases hpxMP was faster than llvm-OpenMP. This occurred because the operation of joiningHPX threads at the end of a parallel region introduces less overheads than the corresponding operation in llvm-OpenMP. Joining the HPX threads are now done with a latch which is executed in user space. The cost of the operationamounts to a single atomic decrement per spawned HPX thread. However, llvm-OpenMP uses kernel threads andtherefore must wait for the operating system to join the participating threads.For smaller input sizes however, the hpxMP is less performant as the overheads introduced by the HPX scheduler aremore signiﬁcant compared to the actual workload. HPX threads require their own stack segment as HPX threads areallowed to be suspended. OpenMP does not incur this overhead as launched tasks are not able to be suspended. In thisway, the llvm-OpenMP implementation produces fewer scheduling overheads.

This work extended hpxMP, an implementation of the OpenMP standard utilizing the light-weight user level threadsof HPX, with a subset og the task features of the OpenMP 5.0 standard. This contribution is one step towards the com-patibility between OpenMP and AMT systems as it demonstrates a technique that enables AMT system to leveragehighly-optimized OpenMP libraries. For the Barcelona OpenMP Task benchmark, hpxMP exhibited similar perfor-mance when compared to other OpenMP runtimes for large task sizes. However, it was not able to compete withthese runtimes when faced with small tasks sizes. This performance decrement arises from the more general purposethreads created in HPX. For the pragma, hpxMP has similar performance for largerinput sizes. By using the HPX latch, the performance could be improved. These results show that hpxMP providesa way for bridging the compatibility gap between OpenMP and AMTs with acceptable performance for larger inputsizes or larger task sizes.In the future, we plan to improve performance for smaller input sizes by adding non-suspending threads to HPX,which do not require a stack, and thus reduce the overhead of thread creation and management. Additionally we planto test the performance of HPX applications which use legacy OpenMP libraries, e.g.

Intel MKL. However, more ofthe OpenMP speciﬁcation needs to be implemented within hpxMP. These experiments will serve as further validationof the techniques introduced in this paper.

Acknowledgment

The work on hpxMP is funded by the National Science Foundation (award ) and and the Department of De-fense (DTIC Contract FA8075-14-D-0002/0007). Any opinions, ﬁndings, conclusions or recommendations expressedin this material are those of the authors and do not necessarily reﬂect the views of the National Science Foundation orthe Department of Defense.

A Source code

The source code of hpxMP [40] and HPX is available on github released under the BSL- . . References [1] T. Zhang, S. Shirzad, P. Diehl, R. Tohid, W. Wei, and H. Kaiser, “An introduction to hpxmp: A modern openmpimplementation leveraging hpx, an asynchronous many-task system,” in

Proceedings of the InternationalWorkshop on OpenCL , ser. IWOCL’19. New York, NY, USA: ACM, 2019, pp. 13:1–13:10. [Online].Available: http://doi.acm.org/10.1145/3318170.3318191 https://github.com/STEllAR-GROUP/hpxMP,https://github.com/STEllAR-GROUP/hpx PREPRINT - F

EBRUARY

20, 2020[2] P. Thoman, K. Dichev, T. Heller, R. Iakymchuk, X. Aguilar, K. Hasanov, P. Gschwandtner, P. Lemarinier,S. Markidis, H. Jordan et al. , “A taxonomy of task-based parallel programming technologies for high-performance computing,”

The Journal of Supercomputing

Proceedings of theUSENIX Summer 1994 Technical Conference on USENIX Summer 1994 Technical Conference - Volume1

Journal of Parallel and Distributed Computing , vol. 74, no. 12, pp. 3202– 3216, 2014, domain-Speciﬁc Languages and High-Level Frameworks for High-Performance Computing.[12] B. L. Chamberlain, D. Callahan, and H. P. Zima, “Parallel Programmability and the Chapel Language,”

Inter-national Journal of High Performance Computing Applications (IJHPCA) , vol. 21, no. 3, pp. 291–312, 2007,https://dx.doi.org/10.1177/1094342007078442.[13] C. E. Leiserson, “The Cilk++ concurrency platform,” in

DAC ’09: Proceedings of the 46th AnnualDesign Automation Conference

Computer , vol. 49, no. 8, p. 10, 2016.[16] R. F. Barrett, D. T. Stark, C. T. Vaughan, R. E. Grant, S. L. Olivier, and K. T. Pedretti, “Toward an evolutionarytask parallel integrated mpi+ x programming model,” in

Proceedings of the Sixth International Workshop onProgramming Models and Applications for Multicores and Manycores . ACM, 2015, pp. 30–39.[17] R. Rabenseifner, G. Hager, and G. Jost, “Hybrid mpi/openmp parallel programming on clusters of multi-core smpnodes,” in .IEEE, 2009, pp. 427–436.[18] L. Smith and M. Bull, “Development of mixed mode mpi/openmp applications,”

Scientiﬁc Programming , vol. 9,no. 2-3, pp. 83–98, 2001.[19] S. Bak, H. Menon, S. White, M. Diener, and L. Kale, “Integrating openmp into the charm++ programmingmodel,” in

Proceedings of the Third International Workshop on Extreme Scale Programming Models and Mid-dleware . ACM, 2017, p. 4.[20] E. Agullo, O. Aumage, B. Bramas, O. Coulaud, and S. Pitoiset, “Bridging the gap between openmp and task-based runtime systems for the fast multipole method,”

IEEE Transactions on Parallel and Distributed Systems ,vol. 28, no. 10, pp. 2794–2807, 2017.[21] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, “Starpu: a uniﬁed platform for task scheduling onheterogeneous multicore architectures,”

Concurrency and Computation: Practice and Experience , vol. 23, no. 2,pp. 187–198, 2011.[22] T. Gautier, X. Besseron, and L. Pigeon, “Kaapi: A thread scheduling runtime system for data ﬂow computa-tions on cluster of multi-processors,” in

Proceedings of the 2007 international workshop on Parallel symboliccomputation , 2007, pp. 15–23.[23] F. Broquedis, T. Gautier, and V. Danjean, “Libkomp, an efﬁcient openmp runtime system for both fork-join anddata ﬂow paradigms,” in

International Workshop on OpenMP . Springer, 2012, pp. 102–115.[24] T. Gautier, J. V. Lima, N. Maillard, and B. Rafﬁn, “Xkaapi: A runtime system for data-ﬂow task programmingon heterogeneous architectures,” in . IEEE, 2013, pp. 1299–1308. 11

PREPRINT - F

EBRUARY

20, 2020[25] H. Kaiser, T. Heller, B. A. Lelbach, A. Serio, and D. Fey, “HPX: A Task Based Programming Model in aGlobal Address Space,” in

Proceedings of the International Conference on Partitioned Global Address SpaceProgramming Models (PGAS) , ser. art. id 6, 2014, https://stellar.cct.lsu.edu/pubs/pgas14.pdf.[26] T. Heller, H. Kaiser, and K. Iglberger, “Application of the ParalleX Execution Model to Stencil-Based Problems,”

Computer Science - Research and Development , vol. 28, no. 2-3, pp. 253–261, 2012, https://stellar.cct.lsu.edu/pubs/isc2012.pdf.[27] T. Heller, H. Kaiser, A. Sch¨afer, and D. Fey, “Using HPX and LibGeoDecomp for Scaling HPC Applications onHeterogeneous Supercomputers,” in

Proceedings of the ACM/IEEE Workshop on Latest Advances in ScalableAlgorithms for Large-Scale Systems (ScalA, SC Workshop) , ser. art. id 1, 2013, https://stellar.cct.lsu.edu/pubs/scala13.pdf.[28] H. Kaiser, T. Heller, D. Bourgeois, and D. Fey, “Higher-level Parallelization for Local and Distributed Asyn-chronous Task-based Programming,” in

Proceedings of the ACM/IEEE International Workshop on Extreme ScaleProgramming Models and Middleware (ESPM, SC Workshop) , 2015, pp. 29–37, https://stellar.cct.lsu.edu/pubs/executors espm2 2015.pdf.[29] H. Kaiser, B. A. L. aka wash, T. Heller, A. Berg, J. Biddiscombe, and M. S. et.al., “STEllAR-GROUP/hpx:HPX V1.2.0: The C++ Standards Library for Parallelism and Concurrency,” Nov. 2018. [Online]. Available:https://doi.org/10.5281/zenodo.1484427[30] T. Heller, H. Kaiser, P. Diehl, D. Fey, and M. A. Schweitzer, “Closing the performance gap with modern c++,” in

High Performance Computing , ser. Lecture Notes in Computer Science, M. Taufer, B. Mohr, and J. M. Kunkel,Eds., vol. 9945. Springer International Publishing, 2016, pp. 18–31.[31] B. Wagle, S. Kellar, A. Serio, and H. Kaiser, “Methodology for adaptive active message coalescing in taskbased runtime systems,” in . IEEE, 2018, pp. 1133–1140.[32] H. Kaiser, M. Brodowicz, and T. Sterling, “ParalleX: An advanced parallel execution model for scaling-impairedapplications,” in

Parallel Processing Workshops . Los Alamitos, CA, USA: IEEE Computer Society, 2009, pp.394–401.[33] B. Wagle, M. A. H. Monil, K. Huck, A. D. Malony, A. Serio, and H. Kaiser, “Runtime adaptive task inliningon asynchronous multitasking runtime systems,” in

Proceedings of the 48th International Conference onParallel Processing , ser. ICPP 2019. New York, NY, USA: ACM, 2019, pp. 76:1–76:10. [Online]. Available:http://doi.acm.org/10.1145/3337821.3337915[34] R. F. Van der Wijngaart and T. G. Mattson, “The parallel research kernels,” in , Sep. 2014, pp. 1–6.[35] S. G. Akl and N. Santoro, “Optimal parallel merging and sorting without memory conﬂicts,”

IEEE Transactionson Computers , vol. 100, no. 11, pp. 1367–1369, 1987.[36] T. Gautier, C. P´erez, and J. Richard, “On the impact of openmp task granularity,” in

International Workshop onOpenMP . Springer, 2018, pp. 205–221.[37] H. Vandierendonck, G. Tzenakis, and D. S. Nikolopoulos, “A uniﬁed scheduler for recursive and task dataﬂowparallelism,” in . IEEE,2011, pp. 1–11.[38] J. Bueno, L. Martinell, A. Duran, M. Farreras, X. Martorell, R. M. Badia, E. Ayguade, and J. Labarta, “Productivecluster programming with ompss,” in