Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Guillaume Mercier is active.

Publication


Featured researches published by Guillaume Mercier.


european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2009

Towards an Efficient Process Placement Policy for MPI Applications in Multicore Environments

Guillaume Mercier; Jérôme Clet-Ortega

This paper presents a method to efficiently place MPI processes on multicore machines. Since MPI implementations often feature efficient supports for both shared-memory and network communication, an adequate placement policy is a crucial step to improve applications performance. As a case study, we show the results obtained for several NAS computing kernels and explain how the policy influences overall performance. In particular, we found out that a policy merely increasing the intranode communication ratio is not enough and that cache utilization is also an influential factor. A more sophisticated policy (eg. one taking into account the architectures memory structure) is required to observe performance improvements.


international conference on parallel processing | 2006

Data Transfers between Processes in an SMP System: Performance Study and Application to MPI

Darius Buntinas; Guillaume Mercier; William Gropp

This paper focuses on the transfer of large data in SMP systems. Achieving good performance for intranode communication is critical for developing an efficient communication system, especially in the context of SMP clusters. We evaluate the performance of five transfer mechanisms: shared-memory buffers, message queues, the Ptrace system call, kernel module-based copy, and a high-speed network. We evaluate each mechanism based on latency, bandwidth, its impact on application cache usage, and its suitability to support MPI two-sided and one-sided messages


parallel computing | 2007

Implementation and evaluation of shared-memory communication and synchronization operations in MPICH2 using the Nemesis communication subsystem

Darius Buntinas; Guillaume Mercier; William Gropp

This paper presents the implementation of MPICH2 over the Nemesis communication subsystem and the evaluation of its shared-memory performance. We describe design issues as well as some of the optimization techniques we employed. We conducted a performance evaluation over shared memory using microbenchmarks. The evaluation shows that MPICH2 Nemesis has very low communication overhead, making it suitable for smaller-grained applications.


international conference on parallel processing | 2009

Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis

Darius Buntinas; Brice Goglin; David Goodell; Guillaume Mercier; Stéphanie Moreaud

The emergence of multicore processors raises the need to efficiently transfer large amounts of data between local processes. MPICH2 is a highly portable MPI implementation whose large-message communication schemes suffer from high CPU utilization and cache pollution because of the use of a double-buffering strategy, common to many MPI implementations. We introduce two strategies offering a kernel-assisted, single-copy model with support for noncontiguous and asynchronous transfers. The first one uses the now widely available vmsplice Linux system call; the second one further improves performance thanks to a custom kernel module called KNEM. The latter also offers I/OAT copy offload, which is dynamically enabled depending on both hardware cache characteristics and message size. These new solutions outperform the standard transfer method in the MPICH2 implementation when no cache is shared between the processing cores or when very large messages are being transferred. Collective communication operations show a dramatic improvement, and the IS NAS parallel benchmark shows a 25% speedup and better cache efficiency.


Lecture Notes in Computer Science | 2006

Implementation and shared-memory evaluation of MPICH2 over the nemesis communication subsystem

Darius Buntinas; Guillaume Mercier; William Gropp

This paper presents the implementation of MPICH2 over the Nemesis communication subsystem and the evaluation of its shared-memory performance. We describe design issues as well as some of the optimization techniques we employed. We conducted a performance evaluation over shared memory using microbenchmarks as well as application benchmarks. The evaluation shows that MPICH2 Nemesis has very low communication overhead, making it suitable for smaller-grained applications.


EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface | 2011

Improving MPI applications performance on multicore clusters with rank reordering

Guillaume Mercier; Emmanuel Jeannot

Modern hardware architectures featuring multicores and a complex memory hierarchy raise challenges that need to be addressed by parallel applications programmers. It is therefore tempting to adapt an application communication pattern to the characteristics of the underlying hardware. The MPI standard features several functions that allow the ranks of MPI processes to be reordered according to a graph attached to a newly created communicator. In this paper, we explain how theMPICH2 implementation of the MPI Dist graph create function was modified to reorder the MPI process ranks to create a match between the application communication pattern and the hardware topology. The experimental results on a multicore cluster show that improvements can be achieved as long as the application communication pattern is expressed by a relevant metric.


2016 First International Workshop on Communication Optimizations in HPC (COMHPC) | 2016

Topology and affinity aware hierarchical and distributed load-balancing in charm++

Emmanuel Jeannot; Guillaume Mercier; François Tessier

The evolution of massively parallel supercomputers make palpable two issues in particular: the load imbalance and the poor management of data locality in applications. Thus, with the increase of the number of cores and the drastic decrease of amount of memory per core, the large performance needs imply to particularly take care of the load-balancing and as much as possible of the locality of data. One mean to take into account this locality issue relies on the placement of the processing entities and load balancing techniques are relevant in order to improve application performance. With large-scale platforms in mind, we developed a hierarchical and distributed algorithm which aim is to perform a topology-aware load balancing tailored for Charm++ applications. This algorithm is based on both LibTopoMap for the network awareness aspects and on TREEMATCH to determine a relevant placement of the processing entities. We show that the proposed algorithm improves the overall execution time in both the cases of real applications and a synthetic benchmark as well. For this last experiment, we show a scalability up to one millions processing entities.


Proceedings of the 24th European MPI Users' Group Meeting on | 2017

A hierarchical model to manage hardware topology in MPI applications

Emmanuel Jeannot; Farouk Mansouri; Guillaume Mercier

The MPI standard is a major contribution in the landscape of parallel programming. Since its inception in the mid 90s it has ensured portability and performance for parallel applications on a wide spectrum of machines and architectures. With the advent of multicore machines, understanding and taking into account the underlying physical topology and memory hierarchy have become of paramount importance. The MPI standard in its current state, however, and despite recent evolutions is still unable to offer mechanisms to achieve this. In this paper, we detail several additions to the standard that give the user tools to address the hardware topology and data locality issues while improving application performance.


cluster computing and the grid | 2006

Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem

Darius Buntinas; Guillaume Mercier; William Gropp


Archive | 2014

An Overview of Process Mapping Techniques and Algorithms in High-Performance Computing

Torsten Hoefler; Emmanuel Jeannot; Guillaume Mercier

Collaboration


Dive into the Guillaume Mercier's collaboration.

Top Co-Authors

Avatar

Darius Buntinas

Argonne National Laboratory

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

David Goodell

Argonne National Laboratory

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Emmanuel Jeannot

French Institute for Research in Computer Science and Automation

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge