Sylvain Jubertie | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sylvain Jubertie is active.

Explore More

Publication

Featured researches published by Sylvain Jubertie.

international conference on conceptual structures | 2013

OSL: An Algorithmic Skeleton Library with Exceptions

Joeffrey Legaux; Frédéric Loulergue; Sylvain Jubertie

Abstract Exception handling is a traditional and natural mechanism to manage errors and events that disrupt the normal flow of program instructions. In most concurrent or parallel systems, exception handling is done locally or sequentially, and cannot guarantee the global coherence of the system after an exception is caught. Working with a structured parallel model is an advantage in this respect. Algorithmic skeletons, that are patterns of parallel algorithms on distributed data structures, offer such a structured model. However very few algorithmic skeleton libraries provide a specific parallel exception mechanism, and no C++-based library. In this paper we propose the design of an exception mechanism for the C++ Orleans Skeleton Library that ensures the global coherence of the system after exceptions are caught. We explain our design choices, experiment on the performance penalty of its use, and we illustrate how to purposefully use this mechanism to extract the results in the course of some algorithms.

international conference on computational science | 2017

A Multi-level Optimization Strategy to Improve the Performance of Stencil Computation

Gauthier Sornet; Fabrice Dupros; Sylvain Jubertie

Abstract Stencil computation represents an important numerical kernel in scientific computing. Leveraging multi-core or many-core parallelism to optimize such operations represents a major challenge due to both the bandwidth demand and the low arithmetic intensity. The situation is worsened by the complexity of current architectures and the potential impact of various mechanisms (cache memory, vectorization, compilation). In this paper, we describe a multi-level optimization strategy that combines manual vectorization, space tiling and stencil composition. A major effort of this study is to compare our results with the Pochoir framework. We evaluate our methodology with a set of three different compilers (Intel, Clang and GCC) on two recent generations of Intel multi-core platforms. Our results show a good match with the theoretical performance models (i.e. roofline models). We also outperform Pochoir performance by a factor of x2.5 in the best case.

international conference on algorithms and architectures for parallel processing | 2012

Experiments in parallel matrix multiplication on multi-core systems

Joeffrey Legaux; Sylvain Jubertie; Frédéric Loulergue

Matrix multiplication is an example of application that is both easy to specify and to provide a simple implementation. There exist numerous sophisticated algorithms or very efficient complex implementations. In this study we are rather interested in the design/programming overhead with respect to performance benefits. Starting from the naive sequential implementation, the implementation is first optimised by improving data accesses, then by using vector units of modern processors, and we finally propose a parallel version for multi-core architectures. The various proposed optimisations are experimented on several architectures and the trade-off software complexity versus efficiency is evaluated using Halstead metrics.

network and parallel computing | 2007

Performance prediction for mappings of distributed applications on PC clusters

Sylvain Jubertie; Emmanuel Melin

Distributed applications running on clusters may be composed of several components with very different performance requirements. The FlowVR middleware allows the developer to deploy such applications and to define communication and synchronization schemes between components without modifying the code. While it eases the creation of mappings, FlowVR does not come with a performance model. Consequently the optimization of mappings is left to the developers skills. But this task becomes difficult as the number of components and cluster nodes grow and even more complex if the cluster is composed of heterogeneous nodes and networks. In this paper we propose an approach to predict performance of FlowVR distributed applications given a mapping and a cluster. We also give some advice to the developer to create efficient mappings and to avoid configurations which may lead to unexpected performance. Since the FlowVR model is very close to underlying models of lots of distributed codes, our approach can be useful for all designers of such applications.

international conference on high performance computing and simulation | 2014

Development effort and performance trade-off in high-level parallel programming

Joeffrey Legaux; Frédéric Loulergue; Sylvain Jubertie

Research on high-level parallel programming approaches systematically evaluate the performance of applications written using these approaches and informally argue that high-level parallel programming languages or libraries increase the productivity of programmers. In this paper we present a methodology that allows to evaluate the trade-off between programming effort and performance of applications developed using different programming models. We apply this methodology on some implementations of a function solving the all nearest smaller values problem. The high-level implementation is based on a new version of the BSP homomorphism algorithmic skeleton.

european conference on parallel processing | 2008

Mapping Heterogeneous Distributed Applications on Clusters

Sylvain Jubertie; Emmanuel Melin; Jérémie Vautard; Arnaud Lallouet

Performance of distributed applications largely depends on the mapping of their components on the underlying architecture. On one side, component-based approaches provide an abstraction suitable for development, but on the other side, actual hardware becomes every day more complex and heterogeneous. Despite this increasing gap, mapping components to processors and networks is commonly done manually and is mainly a matter of expertise. Worse, the amount of efforts required for this task rarely allows to further consider optimal hardware use or sensitivity analysis of data scaling. In this paper, we rely on a formal and experimentally sound model of performance and propose a constraint programming based framework to find consistent and efficient mappings of an application onto an architecture. Experiments show that an optimal mapping for a medium-sized application can be found in a few seconds.

acm sigplan symposium on principles and practice of parallel programming | 2018

Vectorization of a spectral finite-element numerical kernel

Sylvain Jubertie; Fabrice Dupros; Florent De Martin

In this paper, we present an optimized implementation of the Finite-Element Methods numerical kernel for SIMD vectorization. A typical application is the modelling of seismic wave propagation. In this case, the computations at the element level are generally based on nested loops where the memory accesses are non-contiguous. Moreover, the back and forth from the element level to the global level (e.g., assembly phase) is a serious brake for automatic vectorization by compilers and for efficient reuse of data at the cache memory levels. This is particularly true when the problem under study relies on an unstructured mesh. The application proxies used for our experiments were extracted from EFISPEC code that implements the spectral finite-element method to solve the elastodynamic equations. We underline that the intra-node performance may be further improved. Additionally, we show that standard compilers such as GNU GCC, Clang and Intel ICC are unable to perform automatic vectorization even when the nested loops were reorganized or when SIMD pragmas were added. Due to the irregular memory access pattern, we introduce a dedicated strategy to squeeze the maximum performance out of the SIMD units. Experiments are carried out on Intel Broadwell and Skylake platforms that respectively offer AVX2 and AVX-512 SIMD units. We believe that our vectorization approach may be generic enough to be adapted to other codes.

international conference on high performance computing and simulation | 2013

Managing arbitrary distributions of arrays in Orléans Skeleton Library

Joeffrey Legaux; Frédéric Loulergue; Sylvain Jubertie

Structured parallel models such as algorithmic skeletons offer a global view of the parallel program in contrast with the fragmented view of the SPMD style. This makes program easier to write and to read for users, and offer additional opportunities for optimisation done by the libraries, compilers and/or run-time systems. Algorithmic skeletons are or can be seen as patterns or higher-order functions implemented in parallel, often manipulating distributed data structures. Orléans Skeleton Library (OSL) is a library of parallel algorithmic skeletons, written in C++ on top of MPI, which uses meta-programming techniques for optimisation. Often such libraries have no or limited support for arbitrary distributions of the data structures. In this paper we detail the new OSL skeletons used to manage arbitrary distributions of distributed arrays. We present a parallel regular sampling sort application as an example of application that requires such skeletons.

international conference on conceptual structures | 2014