Supporting OpenMP 5.0 Tasks in hpxMP -- A study of an OpenMP implementation within Task Based Runtime Systems
Tianyi Zhang, Shahrzad Shirzad, Bibek Wagle, Adrian S. Lemoine, Patrick Diehl, Hartmut Kaiser
SS UPPORTING O PEN
MP 5.0 T
ASKS IN HPX
MP - A
STUDY OF AN O PEN MP IMPLEMENTATION WITHIN T ASK B ASED R UNTIME S YSTEMS
A P
REPRINT
Tianyi Zhang, Shahrzad Shirzad, Bibek Wagle, Adrian S. Lemoine, Patrick Diehl , and Hartmut Kaiser
Center of Computation & TechnologyLouisiana State UniversityBaton Rouge, LA { tzhan18,sshirz1,bwagle3 } @lsu.edu, { aserio,pdiehl,hkaiser } @cct.lsu.edu February 20, 2020 A BSTRACT
OpenMP has been the de facto standard for single node parallelism for more than a decade. Recently,asynchronous many-task runtime (AMT) systems have increased in popularity as a new program-ming paradigm for high performance computing applications. One of the major challenges of thisnew paradigm is the incompatibility of the OpenMP thread model and other AMTs. Highly opti-mized OpenMP-based libraries do not perform well when coupled with AMTs because the threadingof both libraries will compete for resources. This paper is a follow-up paper on the fundamental im-plementation of hpxMP, an implementation of the OpenMP standard which utilizes the C++ standardlibrary for Parallelism and Concurrency (HPX) to schedule and manage tasks [1]. In this paper, wepresent the implementation of task features, e.g. taskgroup , task depend , and task_reduction , of the OpenMP 5.0 standard and optimization of the pragma. Weuse the daxpy benchmark, the Barcelona OpenMP Tasks Suite, Parallel research kernels, and Open-BLAS benchmarks to compare the different OpenMp implementations: hpxMP, llvm-OpenMP, andGOMP. We conclude that hpxMP is one possibility to overcome the competition for resources of thedifferent thread models by providing a subset of the OpenMP features using HPX threads. However,the overall performance of hpxMP is not yet comparable with legacy libraries, which are highlyoptimized for a different programming paradigm and optimized over a decode by many contributorsand compiler vendors. K eywords OpenMP,hpxMP,Asynchronous Many-task Systems,C++,clang,gcc,HPX
Asynchronous many-task (AMT) systems have emerged as a new programming paradigm in the high performancecomputing (HPC) community [2]. Many of these applications would benefit from the highly optimized OpenMP-based linear algebra libraries currently available. e.g. eigen, blaze, Intel MKL. However, there is a gap betweenOpenMP and AMT systems, since the user level threads of the AMT systems interfere with the system threads ofOpenMP preventing efficient execution of the application.To close this gap, this paper introduces hpxMP, an implementation of the OpenMP standard, which utilizes a C++standard library for parallelism and concurrency (HPX) [3] to schedule and manage tasks. HpxMP implements theOpenMP standard conform pragma [1]. Furthermore, the OpenMP standard has, sinceOpenMP . [4], begun to introduce task-based concepts such as depended tasks and task groups (OpenMP 4.0 [5]),task-loops (OpenMP 4.5 [6]), and detached tasks (OpenMP 5.0 [7]). This work extends hpxMP with the pragma to provide the concept of task-based programming and validates its implementation against the a r X i v : . [ c s . D C ] F e b PREPRINT - F
EBRUARY
20, 2020Barcelona OpenMP Tasks Suite. Next, hpxMP is validated against the daxpy benchmark, OpenBLAS, and ParallelResearch Kernels benchmarks to compare the different OpenMP implementations: hpxMP, llvm-OpenMP, and GOMP.This paper is structured as follows: Section 2 covers the the related work. Section 3 briefly introduces the featuresof HPX utilized for the implementation of tasks within hpxMP. Section 4 emphasizes the implementation of OpenMPtasks within the HPX framework and shows the subset of OpenMP standard features implemented within hpxMP.Section 5 shows the comparison of the different OpenMP implementations and finally, Section 6 concludes the work.
Multi-threading solutions
For the exploitation of shared memory parallelism on multi-core processors many solutions are available and havebeen intensively studied. The most language independent one is the POSIX thread execution model [8] which exposesfine grain parallelism. In addition, there are more abstract library solutions like Intel’s Threading Building Blocks(TBB) [9], Microsoft’s Parallel Pattern Library (PPL) [10], and Kokkos [11]. TBB provides task parallelism using aC++ template library. PPL provides in addition features like parallel algorithms and containers. Kokkos provides acommon C++ interface for parallel programming models, like CUDA and pthreads. Programming languages such asChapel [12] provide parallel data and task abstractions. Cilk [13] extends the C/C++ language with parallel loop andfork-join constructs to provide single node parallelism. Fore a very detailed review we refer to [2].With the OpenMP 3.0 standard [4] the concept of task-based programming was added. The OpenMP 3.1 [14] standardintroduced task optimization. Depend tasks and task groups provided by the OpenMP 4.0 standard [5] improved thesynchronization of tasks. The OpenMP 4.5 [6] standard allows task-based loops and the OpenMP 5.0 [7] standardallows detaching of tasks, respectively.
Integration of Multi-threading solutions within distributed programming models.
Some major research in the area of
MPI+X [15–18], where the Message Passing Interface (MPI) is used as the dis-tributed programming model and OpenMP as the multi-threaded shared memory programming model has be done.However, less research has be done for
AMT+X , where asynchronous many-task systems are used as the distributedprogramming model and OpenMP as the shared memory programming model. Charm++ integrated OpenMP’s sharedmemory parallelism to its distributed programming model for improving the load balancing [19]. Kstar, a researchC/C++ OpenMP compiler [20], was utilized to generate code compatible with StarPU [21] and Kaapi [22]. Only Kaapiimplements the full set of OpenMP specification [23], such as the capability to create task into the context of a taskregion. The successor XKaapi [24] provides C++ task based interface for both multi-core and multi-GPUs and onlyKaapi provides mult-cluster support.
HPX is an open source C++ standard conferment library for parallelism and concurrency for applications of anyscale. One specialty of HPX is that its API offers a syntactic and semantic equivalent interface for local and remotefunction calls. HPX incorporates well known concepts, e.g. static and dynamic dataflows, fine-grained futures-basedsynchronization, and continuation-style programming [25]. Fore more details we refer to [25–31].After this brief overview, let us look into the relevant parts of HPX in the context of this work. HPX is an implemen-tation of the ParalleX execution model [32] and generates hundreds of millions of so-called light-weight HPX-threads(tasks). These light-weight threads provide fast context switching [3] and lower overloads per thread to make it fea-sible to schedule a large number of tasks with negligible overhead [33]. However, these light-weighted HPX threadsare not compatible with the user threads utilized by the current OpenMP implementation. For more details we refer toour previous paper [1]. This works adds support of OpenMP 5.0 task features to hpxMP. With having a light-weightedHPX thread-based implementation of the OpenMP standard it enables applications that already use HPX for local anddistributed computations to integrate highly optimized libraries that rely on HPX in future. hpxMP is an implementation of the OpenMP standard, which utilizes a C++ standard library for parallelism andconcurrency (HPX) [3] to schedule and manage threads and tasks. We have described the fundamental implementationof hpxMP in previous work [1]. This section addresses the implementation of few important classes in hpxMP, task2
PREPRINT - F
EBRUARY
20, 2020features, such as taskgroup , task depend , and task_reduction in the OpenMP . standard [7], within hpxMPand its optimization for the thread and task synchronization which are the new contribution of this work. An instance of omp_task_data class is set to be associated with each HPX thread by calling hpx::threads::set_thread_data . Instances of omp_task_data are passed by a raw pointer which is reinterpret_cast ed to size_t . For better memory management, a smart pointer boost::intrusive_ptr is introduced to wrap around omp_task_data . The class omp_task_data consists the information describing a thread, such as a pointer to thecurrent team , taskLatch for synchronization and if the task is in taskgroup . The omp_task_data can be retrievedby calling hpx::threads::get_thread_data when needed, which plays an important role in hpxMP runtime.Another important class is parallel_region , containing information in a team , such as teamTaskLatch for tasksynchronization, number of threads requested under the parallel region, and the depth of the current team. Explicit tasks are created using the task construct in hpxMP. hpxMP has implemented the most recent OpenMP 5.0tasking features and synchronization constructs, like task , taskyield , and taskwait . The supported clause associ-ated with are reduction , untied , private , firstprivate , shared , and depend .Explicit tasks are created using in hpxMP. HPX threads are created with the task directives andtasks are running on these HPX threads created. _kmpc_omp_task_alloc is allocating, initializing tasks and thenreturn the generated tasks to the runtime. __kmpc_omp_task is called with the generated task parameter and passedto the hpx_runtime::create_task . The tasks are then running as a normal priority HPX thread by calling function hpx::applier::register_thread_nullary , see Listing 1. Synchronization in tasking implementation of hpxMPare handled with HPX latch, which will be discussed later in Section 4.3.Task dependency was introduced with OpenMP 4.0. The depend clause is . Certain dependency should be satisfied among tasks specified by users. Inthe implementation, future in HPX is employed. The functionality called hpx::future allows for the separation ofthe initiation of an operation and the waiting for the result. A list of tasks that current task depend on are stored in a vector
EBRUARY
20, 2020Listing 1: Implementation of task scheduling in hpxMP kmp_task_t * __kmpc_omp_task_alloc (... ) { kmp_task_t *task = ( kmp_task_t *) new char[ task_size + sizeof_shareds ]; // lots of initilization goes on here return task; } void hpx_runtime :: create_task ( kmp_routine_entry_t task_func , int gtid , intrusive_ptr
This sections summarizes the previous presented features of the OpenMP standard implemented with hpxMP. Table 1shows the directives provided by the program layer and correspond to the main part of the presented library. Table 2shows the runtime library functions of the OpenMP standard provided by hpxMP. Of course, the pragmas and runtime4
PREPRINT - F
EBRUARY
20, 2020Listing 3: Defination of Latch Class in HPX class Latch { public : void count_down_and_wait (); void count_down (std :: ptrdiff_t n); bool is_ready () const noexcept ; void wait () const; void count_up (std :: ptrdiff_t n); void reset(std :: ptrdiff_t n); protected : mutable util :: cache_line_data
PREPRINT - F
EBRUARY
20, 2020 omp_get_dynamic omp_get_max_threadsomp_get_num_procs omp_get_num_threadsomp_get_thread_num omp_get_wtickomp_get_wtime omp_in_parallelomp_init_lock omp_init_nest_lockomp_set_dynamic omp_set_lockomp_set_nest_lock omp_set_num_threadsomp_test_lock omp_test_nest_lockomp_unset_lock omp_unset_nest_lock
Table 2: Directives implemented in the program layer of hpxMP. These functions correspond to the main part of thepresented library.
In order to compare the performance of , which is a fundamental pragma in OpenMP,Daxpy benchmark is used in this measurement. Daxpy is a benchmark that measures the multiplication of a floatnumber c with a dense vector a consists 32 bit floating numbers, then add the result with another dense vector b (32bit float), the result is stored in the same vector b , where c ∈ R and a, b ∈ R n .The Daxpy benchmark compares the performance calculated in Mega Floating Point Operations Per Second(MFLOP/s). We determine the speedup of the application by scaling our results to the single-threaded run of thebenchmark using hpxMP.Figure 1 shows the speedup ratio with different numbers of threads. Our first experiment compared the performanceof the OpenMP implementations when the vector size was set to , see Figure 1d. llvm-OpenMP runs the fastestwhile following with GOMP and hpxMP. Figure 1c shows that with a vector size of , GOMP and llvm-OpenMPare still able to exhibit some scaling while hpxMP struggles to scale past threads. For very large vector sizes of and , the three implementations perform almost identically. hpxMP is able to scale in these scenarios because thereis sufficient work in each task in order to amortize the cost of the task management overheads. We chose to use the DGEMM benchmark from Parallel Research Kernels [34] to test our implementation. Thepurpose of the DGEMM program is to test the performance doing a dense matrix multiplication. The DGEMMbenchmark compares the performance calculated in execution time(seconds). Figure 2 shows the execution time withdifferent numbers of threads. The performance of the OpenMP implementations when the matrix size was set to isshown in Figure 2a. hpxMP and llvm-OpenMP runs perform similar while both outperform GOMP. Figure 2b showsthat with a matrix size of , GOMP and llvm-OpenMP are still able to exhibit some scaling while hpxMP strugglesto scale past threads and is slower that GOMP after 8 threads. We chose to use the fast parallel sorting variation of the ordinary mergesort [35] of the Barcelona OpenMP Tasks Suiteto test our implementation of tasks. We sorted a random array with to . The cut off value determines when to perform serial quicksort instead of dividing the array into portions recursively when tasks are created. Higher cut off values create larger size of tasks and, therefore, fewer tasksare created. In order to simplify the experiment, parallel merge is disabled and the threshold for insertion sort is set to in this benchmark. For each cut off value, the execution time of hpxMP using thread is selected as the base pointto calculate speedup values. Figure 3 shows the speedup ratio when using different numbers of threads.For the cut off value (Figure 3a), the array is divided into four sections and four tasks in total are created. Thespeed up curve rapidly increases when moving from threads to , but no significant speedup is achieved when usingmore than threads in all three implementations. HpxMP and llvm-OpenMP show comparable performance whileGOMP is slower.The cut off value of (Figure 3b) increases the number of tasks generated. In this case, llvm-OpenMP has aperformance advantage while hpxMP and GOMP show comparable performance. https://github.com/ParRes/Kernels PREPRINT - F
EBRUARY
20, 2020 number of threads s p ee d u p ( v e c t o r s i z e ^ ) hpxMPllvm-OpenMPGOMP (a) number of threads s p ee d u p ( v e c t o r s i z e ^ ) hpxMPllvm-OpenMPGOMP (b) number of threads s p ee d u p ( v e c t o r s i z e ^ ) hpxMPllvm-OpenMPGOMP (c) number of threads s p ee d u p ( v e c t o r s i z e ^ ) hpxMPllvm-OpenMPGOMP (d) Figure 1: Scaling plots for the Daxpy benchmarks running with different vector sizes: (a) , (b) , (c) , and(d) . Larger vector sizes mean larger tasks are created. The speedup is calculated by scaling the execution time ofa run by the execution time of the single threaded run of hpxMP. A larger speedup factor means a smaller executiontime of the sample. number of threads (vector size 1000 ) s hpxMPllvm-OpenMPGOMP (a) cut off number of threads (vector size 100 ) s hpxMPllvm-OpenMPGOMP (b) cut off Figure 2: Scaling plots for the Parallel Research Kernels DGEMM Benchmarks ran with the vector size of: (a) ,(b) . The relation between time consumed and the number of threads using hpxMP, llvm-OpenMP, and GOMPare plotted. The time consumed is calculated by the execution time (s). A smaller time execution time means betterperformance of the sample. 7
PREPRINT - F
EBRUARY
20, 2020 number of threads s p ee d u p ( c u t o ff ^ ) hpxMPllvm-OpenMPGOMP (a) cut off number of threads s p ee d u p ( c u t o ff ^ ) hpxMPllvm-OpenMPGOMP (b) cut off number of threads s p ee d u p ( c u t o ff ^ ) hpxMPllvm-OpenMPGOMP (c) cut off number of threads s p ee d u p ( c u t o ff ) hpxMPllvm-OpenMPGOMP (d) cut off Figure 3: Scaling plots for the Barcelona OpenMP Task suit’s Sort Benchmarks ran with the cut off values of: (a) ,(b) , (c) , and (d) . Higher cut off values indicate a smaller number of larger tasks are created. The speed upis calculated by scaling the execution time of a run by the execution time of the single threaded run of hpxMP.For the cut off value (Figure 3c), llvm-OpenMP shows a distinct performance advantage over hpxMP and GOMP.Nevertheless, hpxMP still scales across all the threads while GOMP has ceases to scale past threads.For a cut off value of (Figure 3d), a significant number of tasks are created and the work for each task is considerablysmall. Here, hpxMP does not scale due to the the large amount of overheads associated with the creation of manyuser tasks. Because each task performs little work, the overhead that they create is not amortized by the increase inconcurrency.For a global view, the speedup ratio r is shown in Figure 4, where the larger the heatmap value is, the better perfor-mance OpenMP has achieved in comparison to hpxMP. Values below mean that hpxMP outperforms the OpenMPimplementation. As shown in the heatmap, llvm-OpenMP works best when the task granularity is small and the num-ber of tasks created is high. GOMP is slower than both implementations in most cases. For large task sizes, hpxMPis comparable with llvm-OpenMP (Figure 4a). This result demonstrates that when the grain size of the task is chosenwell hpxMP will not incur a performance penalty. Here, some more research has to be done on how hpxMP can handletask granularity and limit the overhead in task management for small grain sizes. Some related work can be foundhere [23, 24, 36–38]. In this section, the dmatdmatadd benchmark from Blaze’s benchmark suite is rerun to demonstrate the recent im-provements in performance when compared to the authors previous work [1] . Blaze [39] is a high performance C++linear algebra library which can use different backends for parallelization. It also provides a benchmark suite calledBlazemark for comparing the performance of several linear algebra libraries, as well as different backends used byBlaze, for a selection of arithmetic operations. The results obtained from dmatdmatadd are presented and graphs areillustrated for a specific number of cores ( , , , and ) accordingly. The series in the graphs are obtained by runningthe benchmark with llvm-OpenMP, an older version of hpxMP, and the current state of hpxMP [1]. https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks PREPRINT - F
EBRUARY
20, 2020 number of threads c u t o ff r a t i o r (a) llvm-OpenMP/hpxMP number of threads c u t o ff r a t i o r (b) GOMP/hpxMP Figure 4: Speedup Ratio of Barcelona OpenMP Task suit’s Sort Benchmark ran over several threads and cut off valuesusing the hpxMP, llvm-OpenMP, and GOMP implementations. Values greater than mean that the OpenMP imple-mentation achieved better performance when compared to hpxMP. Values below indicate that hpxMP outperformedthe OpenMP implementation. size n M F l o p s ( t h r e a d s ) hpxMP previoushpxMPOpenMP (a) size n M F l o p s ( t h r e a d s ) hpxMP previoushpxMPOpenMP (b) size n M F l o p s ( t h r e a d s ) hpxMP previoushpxMPOpenMP (c) size n M F l o p s ( t h r e a d s ) hpxMP previoushpxMPOpenMP (d) Figure 5: Scaling plots for dmatdmatadd Benchmarks for different number of threads, compared to llvm-OpenMP andprevious work of the authors [1]: (a) , (b) , (c) , and (d) PREPRINT - F
EBRUARY
20, 2020The Dense Matrix Addition benchmark (dmatdmatadd) adds two dense matrices A and B , where A, B ∈ R n × n ,and writes the result in matrix C ∈ R n × n . The operation can be written as C [ i, j ] = A [ i, j ] + B [ i, j ] . Blaze usesthe threshold of , elements, which corresponds to matrix size by , before parallelizing the operation.Matrices with less than , elements are added sequentially. Figure 5 demonstrates the new scaling results for thedmatdmatadd benchmark using , , , and cores. We observe notable improvement in performance between theprevious version of hpxMP and the current version. The performance more closely mimic that of llvm-OpenMP. With the results presented above, we showed that hpxMP has similar performance to llvm-OpenMP for larger inputsizes. For some specific cases hpxMP was faster than llvm-OpenMP. This occurred because the operation of joiningHPX threads at the end of a parallel region introduces less overheads than the corresponding operation in llvm-OpenMP. Joining the HPX threads are now done with a latch which is executed in user space. The cost of the operationamounts to a single atomic decrement per spawned HPX thread. However, llvm-OpenMP uses kernel threads andtherefore must wait for the operating system to join the participating threads.For smaller input sizes however, the hpxMP is less performant as the overheads introduced by the HPX scheduler aremore significant compared to the actual workload. HPX threads require their own stack segment as HPX threads areallowed to be suspended. OpenMP does not incur this overhead as launched tasks are not able to be suspended. In thisway, the llvm-OpenMP implementation produces fewer scheduling overheads.
This work extended hpxMP, an implementation of the OpenMP standard utilizing the light-weight user level threadsof HPX, with a subset og the task features of the OpenMP 5.0 standard. This contribution is one step towards the com-patibility between OpenMP and AMT systems as it demonstrates a technique that enables AMT system to leveragehighly-optimized OpenMP libraries. For the Barcelona OpenMP Task benchmark, hpxMP exhibited similar perfor-mance when compared to other OpenMP runtimes for large task sizes. However, it was not able to compete withthese runtimes when faced with small tasks sizes. This performance decrement arises from the more general purposethreads created in HPX. For the pragma, hpxMP has similar performance for largerinput sizes. By using the HPX latch, the performance could be improved. These results show that hpxMP providesa way for bridging the compatibility gap between OpenMP and AMTs with acceptable performance for larger inputsizes or larger task sizes.In the future, we plan to improve performance for smaller input sizes by adding non-suspending threads to HPX,which do not require a stack, and thus reduce the overhead of thread creation and management. Additionally we planto test the performance of HPX applications which use legacy OpenMP libraries, e.g.
Intel MKL. However, more ofthe OpenMP specification needs to be implemented within hpxMP. These experiments will serve as further validationof the techniques introduced in this paper.
Acknowledgment
The work on hpxMP is funded by the National Science Foundation (award ) and and the Department of De-fense (DTIC Contract FA8075-14-D-0002/0007). Any opinions, findings, conclusions or recommendations expressedin this material are those of the authors and do not necessarily reflect the views of the National Science Foundation orthe Department of Defense.
A Source code
The source code of hpxMP [40] and HPX is available on github released under the BSL- . . References [1] T. Zhang, S. Shirzad, P. Diehl, R. Tohid, W. Wei, and H. Kaiser, “An introduction to hpxmp: A modern openmpimplementation leveraging hpx, an asynchronous many-task system,” in
Proceedings of the InternationalWorkshop on OpenCL , ser. IWOCL’19. New York, NY, USA: ACM, 2019, pp. 13:1–13:10. [Online].Available: http://doi.acm.org/10.1145/3318170.3318191 https://github.com/STEllAR-GROUP/hpxMP,https://github.com/STEllAR-GROUP/hpx PREPRINT - F
EBRUARY
20, 2020[2] P. Thoman, K. Dichev, T. Heller, R. Iakymchuk, X. Aguilar, K. Hasanov, P. Gschwandtner, P. Lemarinier,S. Markidis, H. Jordan et al. , “A taxonomy of task-based parallel programming technologies for high-performance computing,”
The Journal of Supercomputing
Proceedings of theUSENIX Summer 1994 Technical Conference on USENIX Summer 1994 Technical Conference - Volume1
Journal of Parallel and Distributed Computing , vol. 74, no. 12, pp. 3202– 3216, 2014, domain-Specific Languages and High-Level Frameworks for High-Performance Computing.[12] B. L. Chamberlain, D. Callahan, and H. P. Zima, “Parallel Programmability and the Chapel Language,”
Inter-national Journal of High Performance Computing Applications (IJHPCA) , vol. 21, no. 3, pp. 291–312, 2007,https://dx.doi.org/10.1177/1094342007078442.[13] C. E. Leiserson, “The Cilk++ concurrency platform,” in
DAC ’09: Proceedings of the 46th AnnualDesign Automation Conference
Computer , vol. 49, no. 8, p. 10, 2016.[16] R. F. Barrett, D. T. Stark, C. T. Vaughan, R. E. Grant, S. L. Olivier, and K. T. Pedretti, “Toward an evolutionarytask parallel integrated mpi+ x programming model,” in
Proceedings of the Sixth International Workshop onProgramming Models and Applications for Multicores and Manycores . ACM, 2015, pp. 30–39.[17] R. Rabenseifner, G. Hager, and G. Jost, “Hybrid mpi/openmp parallel programming on clusters of multi-core smpnodes,” in .IEEE, 2009, pp. 427–436.[18] L. Smith and M. Bull, “Development of mixed mode mpi/openmp applications,”
Scientific Programming , vol. 9,no. 2-3, pp. 83–98, 2001.[19] S. Bak, H. Menon, S. White, M. Diener, and L. Kale, “Integrating openmp into the charm++ programmingmodel,” in
Proceedings of the Third International Workshop on Extreme Scale Programming Models and Mid-dleware . ACM, 2017, p. 4.[20] E. Agullo, O. Aumage, B. Bramas, O. Coulaud, and S. Pitoiset, “Bridging the gap between openmp and task-based runtime systems for the fast multipole method,”
IEEE Transactions on Parallel and Distributed Systems ,vol. 28, no. 10, pp. 2794–2807, 2017.[21] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, “Starpu: a unified platform for task scheduling onheterogeneous multicore architectures,”
Concurrency and Computation: Practice and Experience , vol. 23, no. 2,pp. 187–198, 2011.[22] T. Gautier, X. Besseron, and L. Pigeon, “Kaapi: A thread scheduling runtime system for data flow computa-tions on cluster of multi-processors,” in
Proceedings of the 2007 international workshop on Parallel symboliccomputation , 2007, pp. 15–23.[23] F. Broquedis, T. Gautier, and V. Danjean, “Libkomp, an efficient openmp runtime system for both fork-join anddata flow paradigms,” in
International Workshop on OpenMP . Springer, 2012, pp. 102–115.[24] T. Gautier, J. V. Lima, N. Maillard, and B. Raffin, “Xkaapi: A runtime system for data-flow task programmingon heterogeneous architectures,” in . IEEE, 2013, pp. 1299–1308. 11
PREPRINT - F
EBRUARY
20, 2020[25] H. Kaiser, T. Heller, B. A. Lelbach, A. Serio, and D. Fey, “HPX: A Task Based Programming Model in aGlobal Address Space,” in
Proceedings of the International Conference on Partitioned Global Address SpaceProgramming Models (PGAS) , ser. art. id 6, 2014, https://stellar.cct.lsu.edu/pubs/pgas14.pdf.[26] T. Heller, H. Kaiser, and K. Iglberger, “Application of the ParalleX Execution Model to Stencil-Based Problems,”
Computer Science - Research and Development , vol. 28, no. 2-3, pp. 253–261, 2012, https://stellar.cct.lsu.edu/pubs/isc2012.pdf.[27] T. Heller, H. Kaiser, A. Sch¨afer, and D. Fey, “Using HPX and LibGeoDecomp for Scaling HPC Applications onHeterogeneous Supercomputers,” in
Proceedings of the ACM/IEEE Workshop on Latest Advances in ScalableAlgorithms for Large-Scale Systems (ScalA, SC Workshop) , ser. art. id 1, 2013, https://stellar.cct.lsu.edu/pubs/scala13.pdf.[28] H. Kaiser, T. Heller, D. Bourgeois, and D. Fey, “Higher-level Parallelization for Local and Distributed Asyn-chronous Task-based Programming,” in
Proceedings of the ACM/IEEE International Workshop on Extreme ScaleProgramming Models and Middleware (ESPM, SC Workshop) , 2015, pp. 29–37, https://stellar.cct.lsu.edu/pubs/executors espm2 2015.pdf.[29] H. Kaiser, B. A. L. aka wash, T. Heller, A. Berg, J. Biddiscombe, and M. S. et.al., “STEllAR-GROUP/hpx:HPX V1.2.0: The C++ Standards Library for Parallelism and Concurrency,” Nov. 2018. [Online]. Available:https://doi.org/10.5281/zenodo.1484427[30] T. Heller, H. Kaiser, P. Diehl, D. Fey, and M. A. Schweitzer, “Closing the performance gap with modern c++,” in
High Performance Computing , ser. Lecture Notes in Computer Science, M. Taufer, B. Mohr, and J. M. Kunkel,Eds., vol. 9945. Springer International Publishing, 2016, pp. 18–31.[31] B. Wagle, S. Kellar, A. Serio, and H. Kaiser, “Methodology for adaptive active message coalescing in taskbased runtime systems,” in . IEEE, 2018, pp. 1133–1140.[32] H. Kaiser, M. Brodowicz, and T. Sterling, “ParalleX: An advanced parallel execution model for scaling-impairedapplications,” in
Parallel Processing Workshops . Los Alamitos, CA, USA: IEEE Computer Society, 2009, pp.394–401.[33] B. Wagle, M. A. H. Monil, K. Huck, A. D. Malony, A. Serio, and H. Kaiser, “Runtime adaptive task inliningon asynchronous multitasking runtime systems,” in
Proceedings of the 48th International Conference onParallel Processing , ser. ICPP 2019. New York, NY, USA: ACM, 2019, pp. 76:1–76:10. [Online]. Available:http://doi.acm.org/10.1145/3337821.3337915[34] R. F. Van der Wijngaart and T. G. Mattson, “The parallel research kernels,” in , Sep. 2014, pp. 1–6.[35] S. G. Akl and N. Santoro, “Optimal parallel merging and sorting without memory conflicts,”
IEEE Transactionson Computers , vol. 100, no. 11, pp. 1367–1369, 1987.[36] T. Gautier, C. P´erez, and J. Richard, “On the impact of openmp task granularity,” in
International Workshop onOpenMP . Springer, 2018, pp. 205–221.[37] H. Vandierendonck, G. Tzenakis, and D. S. Nikolopoulos, “A unified scheduler for recursive and task dataflowparallelism,” in . IEEE,2011, pp. 1–11.[38] J. Bueno, L. Martinell, A. Duran, M. Farreras, X. Martorell, R. M. Badia, E. Ayguade, and J. Labarta, “Productivecluster programming with ompss,” in