Nathalie Furmento
L'Abri
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Nathalie Furmento.
parallel, distributed and network-based processing | 2010
François Broquedis; Jérôme Clet-Ortega; Stéphanie Moreaud; Nathalie Furmento; Brice Goglin; Guillaume Mercier; Samuel Thibault; Raymond Namyst
The increasing numbers of cores, shared caches and memory nodes within machines introduces a complex hardware topology. High-performance computing applications now have to carefully adapt their placement and behavior according to the underlying hierarchy of hardware resources and their software affinities. We introduce the Hardware Locality (hwloc) software which gathers hardware information about processors, caches, memory nodes and more, and exposes it to applications and runtime systems in a abstracted and portable hierarchical manner. hwloc may significantly help performance by having runtime systems place their tasks or adapt their communication strategies depending on hardware affinities. We show that hwloc can already be used by popular high-performance OpenMP or MPI software. Indeed, scheduling OpenMP threads according to their affinities or placing MPI processes according to their communication patterns shows interesting performance improvement thanks to hwloc. An optimized MPI communication strategy may also be dynamically chosen according to the location of the communicating processes in the machine and its hardware characteristics.
International Journal of Parallel Programming | 2010
François Broquedis; Nathalie Furmento; Brice Goglin; Pierre-André Wacrenier; Raymond Namyst
Exploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid remote memory access penalties. Directive-based programming languages such as OpenMP, can greatly help to perform such a distribution by providing programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system. Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into scheduling hints related to thread-memory affinity issues. These hints enable dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. Several experiments show that mixed solutions (migrating both threads and data) outperform work-stealing based balancing strategies and next-touch-based data distribution policies. These techniques provide insights about additional optimizations.
international parallel and distributed processing symposium | 2009
Brice Goglin; Nathalie Furmento
As the number of cores per machine increases, memory architectures are being redesigned to avoid bus contention and sustain higher throughput needs. The emergence of Non-Uniform Memory Access (NUMA) constraints has caused affinities between threads and buffers to become an important decision criterion for schedulers.
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface | 2012
Cédric Augonnet; Olivier Aumage; Nathalie Furmento; Raymond Namyst; Samuel Thibault
GPUs clusters are becoming widespread HPC platforms. Exploiting them is however challenging, as this requires two separate paradigms (MPI and CUDA or OpenCL) and careful load balancing due to node heterogeneity. Current paradigms usually either limit themselves to offload part of the computation and leave CPUs idle, or require static CPU/GPU work partitioning. We thus have previously proposed StarPU, a runtime system able to dynamically scheduling tasks within a single heterogeneous node. We show how we extended the task paradigm of StarPU with MPI to easily map the task graph on MPI clusters and automatically benefit from optimized execution.
international parallel and distributed processing symposium | 2007
Olivier Aumage; Elisabeth Brunet; Nathalie Furmento; Raymond Namyst
Communication libraries have dramatically made progress over the fifteen years, pushed by the success of cluster architectures as the preferred platform for high performance distributed computing. However, many potential optimizations are left unexplored in the process of mapping application communication requests onto low level network commands. The fundamental cause of this situation is that the design of communication subsystems is mostly focused on reducing the latency by shortening the critical path. In this paper, we present a new communication scheduling engine which dynamically optimizes application requests in accordance with the NICs capabilities and activity. The optimizing code is generic and portable. The database of optimizing strategies may be dynamically extended.
parallel computing technologies | 2001
Françoise Baude; Denis Caromel; Nathalie Furmento; David Sagnol
In the framework of distributed object systems, this paper presents the concepts and an implementation of an overlapping mechanism between communication and computation. This mechanism allows to decrease the execution time of a remote method invocation with parameters of large size. Its implementation and related experiments in the C++// language running on top of Globus and Nexus are described.
ieee international conference on high performance computing data and analytics | 1999
Françoise Baude; Denis Caromel; Nathalie Furmento; David Sagnol
In the framework of distributed object systems, this paper presents the concepts and an implementation of an overlapping mechanism between communication and computation. This mechanism allows to decrease the execution time of a remote method invocation.
international conference on cluster computing | 2009
Brice Goglin; Nathalie Furmento
Achieving high-performance message passing on top of generic Ethernet hardware suffers from the NIC interrupt-driven model where coalescing is usually involved. We present an in-depth study of the impact of interrupt coalescing on the Open-MX performance. It shows that disabling coalescing may not be relevant for most metrics except small-message latency. Two new coalescing strategies are then presented so as to efficiently support both latency-friendly and coalescing-friendly workloads thanks to the NIC looking at Open-MX messages and streams before deciding when to raise interrupts. The implementation of these strategies in the firmware of Myri-10G NICs shows that Open-MX is now able to achieve a low small-message latency, a high large-message throughput, and a satisfying message rate without having to manually tune the coalescing delay depending on the benchmark. Real application performance evaluation further shows that our modifications even improve the NAS Parallel Benchmark IS execution time by 7–8% thanks to our NIC firmware raising up to 20% of additional interrupts at the correct time.
IEEE Transactions on Parallel and Distributed Systems | 2017
Emmanuel Agullo; Olivier Aumage; Mathieu Faverge; Nathalie Furmento; Florent Pruvost; Marc Sergent; Samuel Thibault
The emergence of accelerators as standard computing resources on supercomputers and the subsequent architectural complexity increase revived the need for high-level parallel programming paradigms. Sequential task-based programming model has been shown to efficiently meet this challenge on a single multicore node possibly enhanced with accelerators, which motivated its support in the OpenMP 4.0 standard. In this paper, we show that this paradigm can also be employed to achieve high performance on modern supercomputers composed of multiple such nodes, with extremely limited changes in the user code. To prove this claim, we have extended the StarPU runtime system with an advanced inter-node data management layer that supports this model by posting communications automatically. We illustrate our discussion with the task- based tile Cholesky algorithm that we implemented on top of this new runtime system layer. We show that it allows for very high productivity while achieving a performance competitive with both the pure Message Passing Interface (MPI)-based ScaLAPACK Cholesky reference implemen- tation and the DPLASMA Cholesky code, which implements another (non sequential) task-based programming paradigm.
parallel computing technologies | 2002
Françoise Baude; Denis Caromel; Nathalie Furmento; David Sagnol
In the framework of distributed object systems, this paper presents the concepts and an implementation of an overlapping mechanism between communication and computation. This mechanism allows to decrease the execution time of a remote method invocation with parameters of large size. Its implementation and related experiments in the C++//language running on top of GLOBUS and NEXUS are described.