Nicolas Maillard | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nicolas Maillard is active.

Explore More

Publication

Featured researches published by Nicolas Maillard.

international parallel and distributed processing symposium | 2013

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures

Thierry Gautier; Joao Vicente Ferreira Lima; Nicolas Maillard; Bruno Raffin

Most recent HPC platforms have heterogeneous nodes composed of multi-core CPUs and accelerators, like GPUs. Programming such nodes is typically based on a combination of OpenMP and CUDA/OpenCL codes; scheduling relies on a static partitioning and cost model. We present the XKaapi runtime system for data-flow task programming on multi-CPU and multi-GPU architectures, which supports a data-flow task model and a locality-aware work stealing scheduler. XKaapi enables task multi-implementation on CPU or GPU and multi-level parallelism with different grain sizes. We show performance results on two dense linear algebra kernels, matrix product (GEMM) and Cholesky factorization (POTRF), to evaluate XKaapi on a heterogeneous architecture composed of two hexa-core CPUs and eight NVIDIA Fermi GPUs. Our conclusion is two-fold. First, fine grained parallelism and online scheduling achieve performance results as good as static strategies, and in most cases outperform them. This is due to an improved work stealing strategy that includes locality information; a very light implementation of the tasks in XKaapi; and an optimized search for ready tasks. Next, the multi-level parallelism on multiple CPUs and GPUs enabled by XKaapi led to a highly efficient Cholesky factorization. Using eight NVIDIA Fermi GPUs and four CPUs, we measure up to 2.43 TFlop/s on double precision matrix product and 1.79 TFlop/s on Cholesky factorization; and respectively 5.09 TFlop/s and 3.92 TFlop/s in single precision.

symposium on computer architecture and high performance computing | 2012

Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs

Joao Vicente Ferreira Lima; Thierry Gautier; Nicolas Maillard; Vincent Danjean

The race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing power, programmers have to solve the issue of scheduling parallel programs on hybrid architectures. And, since the performance of a GPU increases at a much faster rate than the throughput of a PCI bus, data transfers must be managed efficiently by the scheduler. This paper targets multi-GPU compute nodes, where several GPUs are connected to the same machine. To overcome the data transfer limitations on such platforms, the available soft wares compute, usually before the execution, a mapping of the tasks that respects their dependencies and minimizes the global data transfers. Such an approach is too rigid and it cannot adapt the execution to possible variations of the system or to the applications load. We propose a solution that is orthogonal to the above mentioned: extensions of the Xkaapi software stack that enable to exploit full performance of a multi-GPUs system through asynchronous GPU tasks. Xkaapi schedules tasks by using a standard Work Stealing algorithm and the runtime efficiently exploits concurrent GPU operations. The runtime extensions make it possible to overlap the data transfers and the task executions on current generation of GPUs. We demonstrate that the overlapping capability is at least as important as computing a scheduling decision to reduce completion time of a parallel program. Our experiments on two dense linear algebra problems (Matrix Product and Cholesky factorization) show that our solution is highly competitive with other soft wares based on static scheduling. Moreover, we are able to sustain the peak performance (approx. 310 GFlop/s) on DGEMM, even for matrices that cannot be stored entirely in one GPU memory. With eight GPUs, we archive a speed-up of 6.74 with respect to single-GPU. The performance of our Cholesky factorization, with more complex dependencies between tasks, outperforms the state of the art single-GPU MAGMA code.

Archive | 2012

Improving Atmospheric Model Performance on a Multi-Core Cluster System

Carla Osthoff; Roberto P. Souto; Fabrício Vilasbôas; Pablo Javier Grunmann; Pedro L. Silva Dias; Francieli Zanon Boito; Rodrigo Virote Kassick; Laércio Lima Pilla; Philippe Olivier Alexandre Navaux; Claudio Schepke; Nicolas Maillard; Jairo Panetta; Pedro Pais Lopes; Robert Walko

Numerical models have been used extensively in the last decades to understand and predict weather phenomena and the climate. In general, models are classified according to their operation domain: global (entire Earth) and regional (country, state, etc). Global models have spatial resolution of about 0.2 to 1.5 degrees of latitude and therefore cannot represent very well the scale of regional weather phenomena. Their main limitation is computing power. On the other hand, regional models have higher resolution but are restricted to limited area domains. Forecasting on limited domain demands the knowledge of future atmospheric conditions at domain’s borders. Therefore, regional models require previous execution of global models.

2012 13th Symposium on Computer Systems | 2012

Work Stealing on Hybrid Architectures

Vinicius Garcia Pinto; Nicolas Maillard

Parallel computing systems have been based on multicore CPUs and specialized coprocessors, like GPUs. Work-stealing is a scheduling technique that has been used to distribute and redistribute the workload among resources in an efficient way. This work aims to propose, implement and validate a scheduling approach based on work stealing in parallel systems with CPUs and GPUs simultaneously. Results show that our approach, called WORMS, presents competitive performance when compared to reference tool for multicore CPUs (Cilk). In hybrid scenario, WORMS with multicore+GPU outperforms WORMS and Cilk with multicore only and also the GPU reference tool (Thrust).

2012 13th Symposium on Computer Systems | 2012

Exploring Multi-level Parallelism in Atmospheric Applications

Claudio Schepke; Nicolas Maillard

Forecast precisions of climatological models are limited by computing power and time available for the executions. The more and faster processors are used in the computation, the resolution of the mesh adopted to represent the Earths atmosphere can be increased, and consequently the numerical forecasts are more accurate. With the introduction of multi-core processors and GPU boards, computer architectures have many parallel layers. Today, there are parallelism inside a processor, among processors and among computers. In order to best utilize the performance of the computers it is necessary to consider all parallel levels to distribute a concurrent application. However, no parallel programming interface abstracts well these different parallel levels. Based in this context, this work proposes the use of mixed programming interfaces to improve performance to atmospheric models. The parallel execution of simulations shows that the use of GPUs and multi-core CPUs in distributed systems can reduce considerably the execution time of climatological applications.

computational science and engineering | 2008

ICE: Managing Multiple Clusters Using Web Services

Rodrigo da Rosa Righi; Laércio Lima Pilla; Alexandre Carissimi; Nicolas Maillard; Philippe Olivier Alexandre Navaux

High-performance computing architectures are mainly represented by clusters and grids. Usually, clusters are composed by a controled environment, using specific tools that work well to obtain performance. Grids present a set of machines that work as a unique image and treat issues like resource discovery and heterogeneity that are not common in clusters. Analyzing this context, we detect a situation between clusters and grids where users have an isolated access to different clusters. This scenario usually obligates users to learn details about how to interact with the tools of each cluster that they have access to. The management of this scenario is the aim of ICE - Integrated Cluster Environment. ICE is based on a service-oriented architecture (SOA) that hides the complexity to deal with multiple clusters. It offers a Web Portal to achieve access transparency and its framework provides extensibility, facilitating the management of cluster tools and its functionalities. This paper describes ICE, its testbed environment, results and related work.

cluster computing and the grid | 2006

ICE: A Service Oriented Approach to Uniform the Access and Management of Cluster Environments

Clarissa Cassales Marquezan; Rodrigo da Rosa Righi; Lucas Mello Schnorr; Alexandre Carissimi; Nicolas Maillard; P.O. Alexandre Navaux

This paper presents the Integrated Cluster Environment - ICE, which aims to provide an efficient and easy way to interact with several clusters. Generally, when users and clusters administrators have access to several clusters they have to deal with different kind of tools in each cluster. For example: different job schedulers, resource monitoring and administrative configuration issues. Sometimes these tools are used to perform similar tasks but using different interfaces with different parameters. ICE environment provides an alternative to uniform the access to different cluster tools, allowing users and administrators to interact and manage them in a transparent manner. Besides transparency, it also provides extensibility, since tools already installed in clusters can be managed by ICE environment just by adopting the integration framework designed. Uniformity, transparency and extensibility are reached due to ICE architecture and Web service usage as the environment middleware. This paper describes ICE architecture, the developed prototype, ICE usage and also a comparison with related work.

International Journal of Information Technology, Communications and Convergence | 2012

Atmospheric models hybrid OpenMP/MPI implementation multicore cluster evaluation

Carla Osthoff; Francieli Zanon Boito; Rodrigo Virote Kassick; Laércio Lima Pilla; Philippe Olivier Alexandre Navaux; Claudio Schepke; Jairo Panetta; Pablo Javier Grunmann; Nicolas Maillard; Pedro L. Silva Dias; Robert Walko

Atmospheric models usually demand high processing power and generate large amounts of data. As the degree of parallelism grows, the I/O operations may become the major impacting factor of their performance. This work shows that a hybrid MPI/OpenMP implementation can improve the performance of the atmospheric model ocean-land-atmosphere model (OLAM) on a multicore cluster environment. We show that the hybrid MPI/OpenMP version of OLAM decreases the number of output files, resulting in better performance for I/O operations. We have evaluated OLAM on the parallel file system PVFS and shown that storing the files on PVFS results in lower performance than using the local disks of the cluster nodes due as a consequence of file creation and network concurrency. We have also shown that further parallel optimisations should be included in the hybrid version in order to improve the parallel execution time of OLAM.

6th Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG) | 2013

Locality-Aware Work Stealing on Multi-CPU and Multi-GPU Architectures

Thierry Gautier; Joao V. F. Lima; Nicolas Maillard; Bruno Raffin

WISP | 2011

Improving Performance on Atmospheric Models through a Hybrid OpenMP/MPI Implementation

Carla Osthoff Ferreira de Barros; Pablo Javier Grunmann; Francieli Zanon Boito; Rodrigo Virote Kassick; Laércio Lima Pilla; Philippe Olivier Alexandre Navaux; Claudio Schepke; Jairo Panetta; Nicolas Maillard; Pedro L. Silva Dias; Robert Walko

Explore More