Raymond Namyst | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Raymond Namyst is active.

Explore More

Publication

Featured researches published by Raymond Namyst.

european conference on parallel processing | 2011

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Cédric Augonnet; Samuel Thibault; Raymond Namyst; Pierre-André Wacrenier

In the field of HPC, the current hardware trend is to design multiprocessor architectures featuring heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE) or data‐parallel accelerators (e.g. GPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We therefore designed StarPU, an original runtime system providing a high‐level, unified execution model tightly coupled with an expressive data management library. The main goal of StarPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run‐time, and we have analyzed their efficiency on several algorithms running simultaneously over multiple cores and a GPU. In addition to substantial improvements regarding execution times, we have obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine. We eventually show that our dynamic approach competes with the highly optimized MAGMA library and overcomes the limitations of the corresponding static scheduling in a portable way. Copyright

parallel, distributed and network-based processing | 2010

hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications

François Broquedis; Jérôme Clet-Ortega; Stéphanie Moreaud; Nathalie Furmento; Brice Goglin; Guillaume Mercier; Samuel Thibault; Raymond Namyst

The increasing numbers of cores, shared caches and memory nodes within machines introduces a complex hardware topology. High-performance computing applications now have to carefully adapt their placement and behavior according to the underlying hierarchy of hardware resources and their software affinities. We introduce the Hardware Locality (hwloc) software which gathers hardware information about processors, caches, memory nodes and more, and exposes it to applications and runtime systems in a abstracted and portable hierarchical manner. hwloc may significantly help performance by having runtime systems place their tasks or adapt their communication strategies depending on hardware affinities. We show that hwloc can already be used by popular high-performance OpenMP or MPI software. Indeed, scheduling OpenMP threads according to their affinities or placing MPI processes according to their communication patterns shows interesting performance improvement thanks to hwloc. An optimized MPI communication strategy may also be dynamically chosen according to the location of the communicating processes in the machine and its hardware characteristics.

international conference on parallel and distributed systems | 2010

Data-Aware Task Scheduling on Multi-accelerator Based Platforms

Cédric Augonnet; Jérôme Clet-Ortega; Samuel Thibault; Raymond Namyst

To fully tap into the potential of heterogeneous machines composed of multicore processors and multiple accelerators, simple offloading approaches in which the main trunk of the application runs on regular cores while only specific parts are offloaded on accelerators are not sufficient. The real challenge is to build systems where the application would permanently spread across the entire machine, that is, where parallel tasks would be dynamically scheduled over the full set of available processing units. To face this challenge, we previously proposed StarPU, a runtime system capable of scheduling tasks over multicore machines equipped with GPU accelerators. StarPU uses a software virtual shared memory (VSM) that provides a highlevel programming interface and automates data transfers between processing units so as to enable a dynamic scheduling of tasks. We now present how we have extended StarPU to minimize the cost of transfers between processing units in order to efficiently cope with multi-GPU hardware configurations. To this end, our runtime system implements data prefetching based on asynchronous data transfers, and uses data transfer cost prediction to influence the decisions taken by the task scheduler. We demonstrate the relevance of our approach by benchmarking two parallel numerical algorithms using our runtime system. We obtain significant speedups and high efficiency over multicore machines equipped with multiple accelerators. We also evaluate the behaviour of these applications over clusters featuring multiple GPUs per node, showing how our runtime system can combine with MPI.

parallel computing | 2001

The hyperion system: compiling multithreaded java bytecode for distributed execution

Gabriel Antoniu; Luc Bougé; Philip J. Hatcher; Mark MacBeth; Keith McGuigan; Raymond Namyst

Our work combines Java compilation to native code with a runtime library that executes Java threads in a distributed memory environment. This allows a Java programmer to view a cluster of processors as executing a single JAVA virtual machine. The separate processors are simply resources for executing Java threads with true parallelism, and the run-time system provides the illusion of a shared memory on top of the private memories of the processors. The environment we present is available on top of several UNIX systems and can use a large variety of communication interfaces thanks to the high portability of its run time system. To evaluate our approach, we compare serial C, serial Java, and multithreaded Java implementations of a branch and-bound solution to the minimal-cost map-coloring problem. All measurements have been carried out on two platforms using two different communication interfaces: SISCI/SCI and MPI BIP/Myrinet.

international parallel and distributed processing symposium | 2010

Structuring the execution of OpenMP applications for multicore architectures

François Broquedis; Olivier Aumage; Brice Goglin; Samuel Thibault; Pierre-André Wacrenier; Raymond Namyst

The now commonplace multi-core chips have introduced, by design, a deep hierarchy of memory and cache banks within parallel computers as a tradeoff between the user friendliness of shared memory on the one side, and memory access scalability and efficiency on the other side. However, to get high performance out of such machines requires a dynamic mapping of application tasks and data onto the underlying architecture. Moreover, depending on the application behavior, this mapping should favor cache affinity, memory bandwidth, computation synchrony, or a combination of these. The great challenge is then to perform this hardware-dependent mapping in a portable, abstract way. To meet this need, we propose a new, hierarchical approach to the execution of OpenMP threads onto multicore machines. Our ForestGOMP runtime system dynamically generates structured trees out of OpenMP programs. It collects relationship information about threads and data as well. This information is used together with scheduling hints and hardware counter feedback by the scheduler to select the most appropriate threads and data distribution. ForestGOMP features a highlevel platform for developing and tuning portable threads schedulers. We present several applications for which we developed specific scheduling policies that achieve excellent speedups on 16-core machines.

international conference on parallel processing | 2009

Automatic calibration of performance models on heterogeneous multicore architectures

Cédric Augonnet; Samuel Thibault; Raymond Namyst

Multicore architectures featuring specialized accelerators are getting an increasing amount of attention, and this success will probably influence the design of future High Performance Computing hardware. Unfortunately, programmers are actually having a hard time trying to exploit all these heterogeneous computing units efficiently, and most existing efforts simply focus on providing tools to offload some computations on available accelerators. Recently, some runtime systems have been designed that exploit the idea of scheduling - as opposed to offloading - parallel tasks over the whole set of heterogeneous computing units. Scheduling tasks over heterogeneous platforms makes it necessary to use accurate prediction models in order to assign each task to its most adequate computing unit [2]. A deep knowledge of the application is usually required to model per-task performance models, based on the algorithmic complexity of the underlying numeric kernel. We present an alternate, auto-tuning performance prediction approach based on performance history tables dynamically built during the application run. This approach does not require that the programmer provides some specific information. We show that, thanks to the use of a carefully chosen hash-function, our approach quickly achieves accurate performance estimations automatically. Our approach even outperforms regular algorithmic performance models with several linear algebra numerical kernels.

International Journal of Parallel Programming | 2010

ForestGOMP: An Efficient OpenMP Environment for NUMA Architectures

François Broquedis; Nathalie Furmento; Brice Goglin; Pierre-André Wacrenier; Raymond Namyst

Exploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid remote memory access penalties. Directive-based programming languages such as OpenMP, can greatly help to perform such a distribution by providing programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system. Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into scheduling hints related to thread-memory affinity issues. These hints enable dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. Several experiments show that mixed solutions (migrating both threads and data) outperform work-stealing based balancing strategies and next-touch-based data distribution policies. These techniques provide insights about additional optimizations.

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface | 2012

StarPU-MPI: task programming over clusters of machines enhanced with accelerators

Cédric Augonnet; Olivier Aumage; Nathalie Furmento; Raymond Namyst; Samuel Thibault

GPUs clusters are becoming widespread HPC platforms. Exploiting them is however challenging, as this requires two separate paradigms (MPI and CUDA or OpenCL) and careful load balancing due to node heterogeneity. Current paradigms usually either limit themselves to offload part of the computation and leave CPUs idle, or require static CPU/GPU work partitioning. We thus have previously proposed StarPU, a runtime system able to dynamically scheduling tasks within a single heterogeneous node. We show how we extended the task paradigm of StarPU with MPI to easily map the task graph on MPI clusters and automatically benefit from optimized execution.

international parallel processing symposium | 1999

An Efficient and Transparent Thread Migration Scheme in the PM2 Runtime System

Gabriel Antoniu; Luc Bougé; Raymond Namyst

This paper describes a new iso-address approach to the dynamic allocation of data in a multithreaded runtime system with thread migration capability. The system guarantees that the migrated threads and their associated static data are relocated exactly at the same virtual address on the destination nodes, so that no post-migration processing is needed to keep pointers valid. In the experiments reported, a thread can be migrated in less than 75μs.

international parallel and distributed processing symposium | 2001

MPICH/Madeleine: a true multi-protocol MPI for high performance networks

Olivier Aumage; Guillaume Mercier; Raymond Namyst

This paper introduces a version of MPICH handling efficiently different networks simultaneously. The core of the implementation relies on a device called ch-mad which is based on a generic multiprotocol communication library called Madeleine. The performance achieved with tested networks such as Fast-Ethernet, Scalable Coherent Interface or Myrinet is very good. Indeed, this multi-protocol version of MPICH generally outperforms other free or commercial implementations of MPI.

Explore More