Samuel Thibault | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Samuel Thibault is active.

Explore More

Publication

Featured researches published by Samuel Thibault.

european conference on parallel processing | 2011

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Cédric Augonnet; Samuel Thibault; Raymond Namyst; Pierre-André Wacrenier

In the field of HPC, the current hardware trend is to design multiprocessor architectures featuring heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE) or data‐parallel accelerators (e.g. GPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We therefore designed StarPU, an original runtime system providing a high‐level, unified execution model tightly coupled with an expressive data management library. The main goal of StarPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run‐time, and we have analyzed their efficiency on several algorithms running simultaneously over multiple cores and a GPU. In addition to substantial improvements regarding execution times, we have obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine. We eventually show that our dynamic approach competes with the highly optimized MAGMA library and overcomes the limitations of the corresponding static scheduling in a portable way. Copyright

parallel, distributed and network-based processing | 2010

hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications

François Broquedis; Jérôme Clet-Ortega; Stéphanie Moreaud; Nathalie Furmento; Brice Goglin; Guillaume Mercier; Samuel Thibault; Raymond Namyst

The increasing numbers of cores, shared caches and memory nodes within machines introduces a complex hardware topology. High-performance computing applications now have to carefully adapt their placement and behavior according to the underlying hierarchy of hardware resources and their software affinities. We introduce the Hardware Locality (hwloc) software which gathers hardware information about processors, caches, memory nodes and more, and exposes it to applications and runtime systems in a abstracted and portable hierarchical manner. hwloc may significantly help performance by having runtime systems place their tasks or adapt their communication strategies depending on hardware affinities. We show that hwloc can already be used by popular high-performance OpenMP or MPI software. Indeed, scheduling OpenMP threads according to their affinities or placing MPI processes according to their communication patterns shows interesting performance improvement thanks to hwloc. An optimized MPI communication strategy may also be dynamically chosen according to the location of the communicating processes in the machine and its hardware characteristics.

international parallel and distributed processing symposium | 2011

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

Emmanuel Agullo; Cédric Augonnet; Jack J. Dongarra; Mathieu Faverge; Hatem Ltaief; Samuel Thibault; Stanimire Tomov

One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid accelerators-based node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we present the design of a highly efficient QR factorization for such a node. Our method is in three steps. The first step consists of expressing the QR factorization as a sequence of tasks of well chosen granularity that will aim at being executed on a CPU core or a GPU. We show that we can efficiently adapt high-level algorithms from the literature that were initially designed for homogeneous multicore architectures. The second step consists of designing the kernels that implement each individual task. We use CPU kernels from previous work and present new kernels for GPUs that complement kernels already available in the MAGMA library. We show the impact on performance of these GPU kernels. In particular, we present the benefits of new hybrid CPU/GPU kernels. The last step consists of scheduling these tasks on the computational units. We present two alternative approaches, respectively based on static and dynamic scheduling. In the case of static scheduling, we exploit the a priori knowledge of the schedule to perform successive optimizations leading to very high performance. We, however, highlight the lack of portability of this approach and its limitations to relatively simple algorithms on relatively homogeneous nodes. Alternatively, by relying on an efficient runtime system, Star PU, in charge of ensuring data availability and coherency, we can schedule more complex algorithms on complex heterogeneous nodes with much higher productivity. In this latter case, we show that we can achieve high performance in a portable way thanks to a fine interaction between the application and the runtime system. We demonstrate that the obtained performance is very close to the theoretical upper bounds that we obtained using Linear Programming.

international conference on parallel and distributed systems | 2010

Data-Aware Task Scheduling on Multi-accelerator Based Platforms

Cédric Augonnet; Jérôme Clet-Ortega; Samuel Thibault; Raymond Namyst

To fully tap into the potential of heterogeneous machines composed of multicore processors and multiple accelerators, simple offloading approaches in which the main trunk of the application runs on regular cores while only specific parts are offloaded on accelerators are not sufficient. The real challenge is to build systems where the application would permanently spread across the entire machine, that is, where parallel tasks would be dynamically scheduled over the full set of available processing units. To face this challenge, we previously proposed StarPU, a runtime system capable of scheduling tasks over multicore machines equipped with GPU accelerators. StarPU uses a software virtual shared memory (VSM) that provides a highlevel programming interface and automates data transfers between processing units so as to enable a dynamic scheduling of tasks. We now present how we have extended StarPU to minimize the cost of transfers between processing units in order to efficiently cope with multi-GPU hardware configurations. To this end, our runtime system implements data prefetching based on asynchronous data transfers, and uses data transfer cost prediction to influence the decisions taken by the task scheduler. We demonstrate the relevance of our approach by benchmarking two parallel numerical algorithms using our runtime system. We obtain significant speedups and high efficiency over multicore machines equipped with multiple accelerators. We also evaluate the behaviour of these applications over clusters featuring multiple GPUs per node, showing how our runtime system can combine with MPI.

international parallel and distributed processing symposium | 2010

Structuring the execution of OpenMP applications for multicore architectures

François Broquedis; Olivier Aumage; Brice Goglin; Samuel Thibault; Pierre-André Wacrenier; Raymond Namyst

The now commonplace multi-core chips have introduced, by design, a deep hierarchy of memory and cache banks within parallel computers as a tradeoff between the user friendliness of shared memory on the one side, and memory access scalability and efficiency on the other side. However, to get high performance out of such machines requires a dynamic mapping of application tasks and data onto the underlying architecture. Moreover, depending on the application behavior, this mapping should favor cache affinity, memory bandwidth, computation synchrony, or a combination of these. The great challenge is then to perform this hardware-dependent mapping in a portable, abstract way. To meet this need, we propose a new, hierarchical approach to the execution of OpenMP threads onto multicore machines. Our ForestGOMP runtime system dynamically generates structured trees out of OpenMP programs. It collects relationship information about threads and data as well. This information is used together with scheduling hints and hardware counter feedback by the scheduler to select the most appropriate threads and data distribution. ForestGOMP features a highlevel platform for developing and tuning portable threads schedulers. We present several applications for which we developed specific scheduling policies that achieve excellent speedups on 16-core machines.

international conference on parallel processing | 2009

Automatic calibration of performance models on heterogeneous multicore architectures

Cédric Augonnet; Samuel Thibault; Raymond Namyst

Multicore architectures featuring specialized accelerators are getting an increasing amount of attention, and this success will probably influence the design of future High Performance Computing hardware. Unfortunately, programmers are actually having a hard time trying to exploit all these heterogeneous computing units efficiently, and most existing efforts simply focus on providing tools to offload some computations on available accelerators. Recently, some runtime systems have been designed that exploit the idea of scheduling - as opposed to offloading - parallel tasks over the whole set of heterogeneous computing units. Scheduling tasks over heterogeneous platforms makes it necessary to use accurate prediction models in order to assign each task to its most adequate computing unit [2]. A deep knowledge of the application is usually required to model per-task performance models, based on the algorithmic complexity of the underlying numeric kernel. We present an alternate, auto-tuning performance prediction approach based on performance history tables dynamically built during the application run. This approach does not require that the programmer provides some specific information. We show that, thanks to the use of a carefully chosen hash-function, our approach quickly achieves accurate performance estimations automatically. Our approach even outperforms regular algorithmic performance models with several linear algebra numerical kernels.

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface | 2012

StarPU-MPI: task programming over clusters of machines enhanced with accelerators

Cédric Augonnet; Olivier Aumage; Nathalie Furmento; Raymond Namyst; Samuel Thibault

GPUs clusters are becoming widespread HPC platforms. Exploiting them is however challenging, as this requires two separate paradigms (MPI and CUDA or OpenCL) and careful load balancing due to node heterogeneity. Current paradigms usually either limit themselves to offload part of the computation and leave CPUs idle, or require static CPU/GPU work partitioning. We thus have previously proposed StarPU, a runtime system able to dynamically scheduling tasks within a single heterogeneous node. We show how we extended the task paradigm of StarPU with MPI to easily map the task graph on MPI clusters and automatically benefit from optimized execution.

design, automation, and test in europe | 2012

Programmability and performance portability aspects of heterogeneous multi-/manycore systems

Christoph W. Kessler; Usman Dastgeer; Samuel Thibault; Raymond Namyst; Andrew Richards; Uwe Dolinsky; Siegfried Benkner; Jesper Larsson Träff; Sabri Pllana

We discuss three complementary approaches that can provide both portability and an increased level of abstraction for the programming of heterogeneous multicore systems. Together, these approaches also support performance portability, as currently investigated in the EU FP7 project PEPPHER. In particular, we consider (1) a library-based approach, here represented by the integration of the SkePU C++ skeleton programming library with the StarPU runtime system for dynamic scheduling and dynamic selection of suitable execution units for parallel tasks; (2) a language-based approach, here represented by the Offload-C++ high-level language extensions and Offload compiler to generate platform-specific code; and (3) a component-based approach, specifically the PEPPHER component system for annotating user-level application components with performance metadata, thereby preparing them for performance-aware composition. We discuss the strengths and weaknesses of these approaches and show how they could complement each other in an integrational programming framework for heterogeneous multicore systems.

international workshop on openmp | 2014

Evaluation of OpenMP Dependent Tasks with the KASTORS Benchmark Suite

Philippe Virouleau; Pierrick Brunet; François Broquedis; Nathalie Furmento; Samuel Thibault; Olivier Aumage; Thierry Gautier

The recent introduction of task dependencies in the OpenMP specification provides new ways of synchronizing tasks. Application programmers can now describe the data a task will read as input and write as output, letting the runtime system resolve fine-grain dependencies between tasks to decide which task should execute next. Such an approach should scale better than the excessive global synchronization found in most OpenMP 3.0 applications. As promising as it looks however, any new feature needs proper evaluation to encourage application programmers to embrace it. This paper introduces the KASTORS benchmark suite designed to evaluate OpenMP tasks dependencies. We modified state-of-the-art OpenMP 3.0 benchmarks and data-flow parallel linear algebra kernels to make use of tasks dependencies. Learning from this experience, we propose extensions to the current OpenMP specification to improve the expressiveness of dependencies. We eventually evaluate both the GCC/libGOMP and the CLANG/libIOMP implementations of OpenMP 4.0 on our KASTORS suite, demonstrating the interest of task dependencies compared to taskwait-based approaches.

international workshop on openmp | 2008

Scheduling dynamic OpenMP applications over multicore architectures

François Broquedis; François Diakhaté; Samuel Thibault; Olivier Aumage; Raymond Namyst; Pierre-André Wacrenier

Approaching the theoretical performance of hierarchical multicoremachines requires a very careful distribution of threads and dataamong the underlying non-uniform architecture in order to minimizecache misses and NUMA penalties. While it is acknowledged thatOpenMP can enhance the quality of thread scheduling on such architecturesin a portable way, by transmitting precious information aboutthe affinities between threads and data to the underlying runtime system,most OpenMP runtime systems are actually unable to efficiently supporthighly irregular, massively parallel applications on NUMA machines. In this paper, we present a thread scheduling policy suited to theexecution of OpenMP programs featuring irregular and massive nestedparallelism over hierarchical architectures. Our policy enforces a distributionof threads that maximizes the proximity of threads belonging tothe same parallel region, and uses a NUMA-aware work stealing strategywhen load balancing is needed. It has been developed as a plug-in tothe forestGOMP OpenMP platform [TBG+07]. We demonstrate theefficiency of our approach with a highly irregular recursive OpenMP programresulting from the generic parallelization of a surface reconstructionapplication. We achieve a speedup of 14 on a 16-core machine with noapplication-level optimization.

Explore More