Pierre-André Wacrenier

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pierre-André Wacrenier is active.

Explore More

Publication

Featured researches published by Pierre-André Wacrenier.

european conference on parallel processing | 2011

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Cédric Augonnet; Samuel Thibault; Raymond Namyst; Pierre-André Wacrenier

In the field of HPC, the current hardware trend is to design multiprocessor architectures featuring heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE) or data‐parallel accelerators (e.g. GPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We therefore designed StarPU, an original runtime system providing a high‐level, unified execution model tightly coupled with an expressive data management library. The main goal of StarPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run‐time, and we have analyzed their efficiency on several algorithms running simultaneously over multiple cores and a GPU. In addition to substantial improvements regarding execution times, we have obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine. We eventually show that our dynamic approach competes with the highly optimized MAGMA library and overcomes the limitations of the corresponding static scheduling in a portable way. Copyright

international parallel and distributed processing symposium | 2010

Structuring the execution of OpenMP applications for multicore architectures

François Broquedis; Olivier Aumage; Brice Goglin; Samuel Thibault; Pierre-André Wacrenier; Raymond Namyst

The now commonplace multi-core chips have introduced, by design, a deep hierarchy of memory and cache banks within parallel computers as a tradeoff between the user friendliness of shared memory on the one side, and memory access scalability and efficiency on the other side. However, to get high performance out of such machines requires a dynamic mapping of application tasks and data onto the underlying architecture. Moreover, depending on the application behavior, this mapping should favor cache affinity, memory bandwidth, computation synchrony, or a combination of these. The great challenge is then to perform this hardware-dependent mapping in a portable, abstract way. To meet this need, we propose a new, hierarchical approach to the execution of OpenMP threads onto multicore machines. Our ForestGOMP runtime system dynamically generates structured trees out of OpenMP programs. It collects relationship information about threads and data as well. This information is used together with scheduling hints and hardware counter feedback by the scheduler to select the most appropriate threads and data distribution. ForestGOMP features a highlevel platform for developing and tuning portable threads schedulers. We present several applications for which we developed specific scheduling policies that achieve excellent speedups on 16-core machines.

International Journal of Parallel Programming | 2010

ForestGOMP: An Efficient OpenMP Environment for NUMA Architectures

François Broquedis; Nathalie Furmento; Brice Goglin; Pierre-André Wacrenier; Raymond Namyst

Exploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid remote memory access penalties. Directive-based programming languages such as OpenMP, can greatly help to perform such a distribution by providing programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system. Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into scheduling hints related to thread-memory affinity issues. These hints enable dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. Several experiments show that mixed solutions (migrating both threads and data) outperform work-stealing based balancing strategies and next-touch-based data distribution policies. These techniques provide insights about additional optimizations.

applications and theory of petri nets | 2002

Data Decision Diagrams for Petri Net Analysis

Jean-Michel Couvreur; Emmanuelle Encrenaz; Emmanuel Paviot-Adet; Denis Poitrenaud; Pierre-André Wacrenier

This paper presents a new data structure, the Data Decision Diagrams, equipped with a mechanism allowing the definition of application-specific operators. This mechanism is based on combination of inductive linear functions offering a large expressiveness while alleviating for the user the burden of hard coding traversals in a shared data structure. We demonstrate the pertinence of our system through the implementation of a verification tool for various classes of Petri nets including self modifying and queuing nets.

international workshop on openmp | 2008

Scheduling dynamic OpenMP applications over multicore architectures

François Broquedis; François Diakhaté; Samuel Thibault; Olivier Aumage; Raymond Namyst; Pierre-André Wacrenier

Approaching the theoretical performance of hierarchical multicoremachines requires a very careful distribution of threads and dataamong the underlying non-uniform architecture in order to minimizecache misses and NUMA penalties. While it is acknowledged thatOpenMP can enhance the quality of thread scheduling on such architecturesin a portable way, by transmitting precious information aboutthe affinities between threads and data to the underlying runtime system,most OpenMP runtime systems are actually unable to efficiently supporthighly irregular, massively parallel applications on NUMA machines. In this paper, we present a thread scheduling policy suited to theexecution of OpenMP programs featuring irregular and massive nestedparallelism over hierarchical architectures. Our policy enforces a distributionof threads that maximizes the proximity of threads belonging tothe same parallel region, and uses a NUMA-aware work stealing strategywhen load balancing is needed. It has been developed as a plug-in tothe forestGOMP OpenMP platform [TBG+07]. We demonstrate theefficiency of our approach with a highly irregular recursive OpenMP programresulting from the generic parallelization of a surface reconstructionapplication. We achieve a speedup of 14 on a 16-core machine with noapplication-level optimization.

international workshop on openmp | 2007

An Efficient OpenMP Runtime System for Hierarchical Architectures

Samuel Thibault; François Broquedis; Brice Goglin; Raymond Namyst; Pierre-André Wacrenier

Exploiting the full computational power of always deeper hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture. The emergence of multi-core chips and NUMA machines makes it important to minimize the number of remote memory accesses, to favor cache affinities, and to guarantee fast completion of synchronization steps. By using the BubbleSched platform as a threading backend for the GOMP OpenMP compiler, we are able to easily transpose affinities of thread teams into scheduling hints using abstractions called bubbles. We then propose a scheduling strategy suited to nested OpenMP parallelism. The resulting preliminary performance evaluations show an important improvement of the speedup on a typical NAS OpenMP benchmark application.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

Composing Multiple StarPU Applications over Heterogeneous Machines: A Supervised Approach

Andra-Ecaterina Hugo; Abdou Guermouche; Pierre-André Wacrenier; Raymond Namyst

Enabling HPC applications to perform efficiently when invoking multiple parallel libraries simultaneously is a great challenge. Even if a single runtime system is used underneath, scheduling tasks or threads coming from different libraries over the same set of hardware resources introduces many issues, such as resource oversubscription, undesirable cache flushes or memory bus contention. This paper presents an extension of StarPU, a runtime system specifically designed for heterogeneous architectures, that allows multiple parallel codes to run concurrently with minimal interference. Such parallel codes run within scheduling contexts that provide confined execution environments which can be used to partition computing resources. Scheduling contexts can be dynamically resized to optimize the allocation of computing resources among concurrently running libraries. We introduce a hypervisor that automatically expands or shrinks contexts using feedback from the runtime system (e.g. resource utilization). We demonstrate the relevance of our approach using benchmarks invoking multiple high performance linear algebra kernels simultaneously on top of heterogeneous multicore machines. We show that our mechanism can dramatically improve the overall application run time (-34%), most notably by reducing the average cache miss ratio (-50%).

ieee international conference on high performance computing data and analytics | 2014

Composing multiple StarPU applications over heterogeneous machines: A supervised approach

Andra-Ecaterina Hugo; Abdou Guermouche; Pierre-André Wacrenier; Raymond Namyst

Enabling HPC applications to perform efficiently when invoking multiple parallel libraries simultaneously is a great challenge. Even if a uniform runtime system is used underneath, scheduling tasks or threads coming from different libraries over the same set of hardware resources introduces many issues, such as resource oversubscription, undesirable cache flushes and memory bus contention. This paper presents an extension of StarPU, a runtime system specifically designed for heterogeneous architectures, that allows multiple parallel codes to run concurrently with minimal interference. Such parallel codes run within scheduling contexts that provide confined execution environments which can be used to partition computing resources. Scheduling contexts can be dynamically resized to optimize the allocation of computing resources among concurrently running libraries. We introduce a hypervisor that automatically expands or shrinks contexts using feedback from the runtime system (e.g. resource utilization). We demonstrate the relevance of our approach using benchmarks invoking multiple high performance linear algebra kernels simultaneously on top of heterogeneous multicore machines. We show that our mechanism can dramatically improve the overall application run time (− 34%), most notably by reducing the average cache miss ratio (− 50%).

european conference on parallel processing | 2005

An efficient multi-level trace toolkit for multi-threaded applications

Vincent Danjean; Raymond Namyst; Pierre-André Wacrenier

Nowadays, observing and understanding the behavior and performance of a multi-threaded application is a nontrivial task, especially within a complex multi-threaded environment such as a multi-level thread scheduler. In this paper, we present a trace toolkit that allows programmers to precisely analyze the behavior of a multi-threaded application. Running an application through this toolkit generates several traces which are merged and analyzed offline. The resulting super-trace contains not only classical information but also detailed informations about thread scheduling at multiple levels.

international colloquium on automata languages and programming | 1995

Computing the Closure of Sets of Words Under Partial Commutations

Yves Métivier; Gwénaël Richomme; Pierre-André Wacrenier

The aim of this paper is the study of a procedure S given in [11, 13]. We prove that this procedure can compute the closure of the star of a closed recognizable set of words if and only if this closure is also recognizable. This necessary and sufficient condition gives a semi algorithm for the Star Problem. As intermediary results, using S, we give new proofs of some known results.

Explore More