Joel Falcou | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Joel Falcou is active.

Explore More

Publication

Featured researches published by Joel Falcou.

parallel computing | 2006

QUAFF: efficient C++ design for parallel skeletons

Joel Falcou; Jocelyn Sérot; Thierry Chateau; Jean-Thierry Lapresté

We present QUAFF, a new skeleton-based parallel programming library. Its main originality is to rely on C++ template meta-programming techniques to achieve high efficiency. In particular, by performing most of skeleton instantiation and optimization at compile-time, QUAFF can keep the overhead traditionally associated to object-oriented implementations of skeleton-based parallel programming libraries very small. This is not done at the expense of expressivity. This is demonstrated in this paper by several applications, including a full-fledged, realistic real-time vision application.

international conference on parallel processing | 2011

Spherical harmonic transform with GPUs

Ioan Ovidiu Hupca; Joel Falcou; Laura Grigori; Radek Stompor

We describe an algorithm for computing an inverse spherical harmonic transform suitable for graphic processing units (GPU). We use CUDA and base our implementation on a Fortran90 routine included in a publicly available parallel package, s2hat. We focus our attention on two major sequential steps involved in the transforms computation retaining the efficient parallel framework of the original code. We detail optimization techniques used to enhance the performance of the CUDA-based code and contrast them with those implemented in the Fortran90 version. We present performance comparisons of a single CPU plus GPU unit with the s2hat code running on either a single or 4 processors. In particular, we find that the latest generation of GPUs, such as NVIDIA GF100 (Fermi), can accelerate the spherical harmonic transforms by as much as 18 times with respect to s2hat executed on one core, and by as much as 5.5 with respect to s2hat on 4 cores, with the overall performance being limited by the Fast Fourier transforms. The work presented here has been performed in the context of the Cosmic Microwave Background simulations and analysis. However, we expect that the developed software will be of more general interest and applicability.

international conference on conceptual structures | 2013

A parallel solver for incompressible fluid flows

Yushan Wang; Marc Baboulin; Jack J. Dongarra; Joel Falcou; Yann Fraigneau; Olivier P. Le Maître

The Navier-Stokes equations describe a large class of fluid flows but are difficult to solve analytically because of their nonlinearity. We present in this paper a parallel solver for the 3-D Navier-Stokes equations of incompressible unsteady flows with constant coefficients, discretized by the finite difference method. We apply the prediction-projection method which transforms the Navier-Stokes equations into three Helmholtz equations and one Poisson equation. For each Helmholtz system, we apply the Alternating Direction Implicit (ADI) method resulting in three tridiagonal systems. The Poisson equation is solved using partial diagonalization which transforms the Laplacian operator into a tridiagonal one. We describe an implementation based on MPI where the computations are performed on each subdomain and information is exchanged on the interfaces, and where the tridiagonal system solutions are accelerated using vectorization techniques. We present performance results on a current multicore system.

International Journal of Parallel Programming | 2013

Parallel Smith-Waterman Comparison on Multicore and Manycore Computing Platforms with BSP++

Khaled Hamidouche; Fernando Machado Mendonca; Joel Falcou; Alba Cristina Magalhaes Alves de Melo; Daniel Etiemble

Biological Sequence Comparison is an important operation in Bioinformatics that is often used to relate organisms. Smith and Waterman proposed an exact algorithm that compares two sequences in quadratic time and space. Due to high computing power and memory requirements, SW is usually executed on High Performance Computing (HPC) platforms such as multicore clusters and CellBEs. Since HPC architectures exhibit very different hardware characteristics, porting an application to them is an error-prone time-consuming task. BSP++ is an implementation of BSP that aims to facilitate parallel programming, reducing the effort to port code. In this paper, we propose and evaluate a parallel BSP++ strategy to execute SW on multiple multicore and manycore platforms. Given the same base code, we generated MPI, OpenMP, MPI/OpenMP, CellBE and MPI/CellBE versions, which were executed on heterogeneous platforms with up to 6,144 cores. The results obtained with real DNA sequences show that the performance of our versions is comparable to the hand-tuned strategies in the literature, evidencing the appropriateness and flexibility of our approach.

international conference on parallel architectures and compilation techniques | 2009

Algorithmic Skeletons within an Embedded Domain Specific Language for the CELL Processor

Tarik Saidani; Joel Falcou; Claude Tadonki; Lionel Lacassagne; Daniel Etiemble

Efficiently using the hardware capabilities of the Cell processor, a heterogeneous chip multiprocessor that uses several levels of parallelism to deliver high performance, and being able to reuse legacy code are real challenges for application developers. We propose to use Generative Programming and more precisely template meta-programming to design an Domain Specific Embedded Language using algorithmic skeletons to generate applications based on a high-level mapping description. The method is easy to use by developers and delivers performance close to the performance of optimized hand-written code, as shown on various benchmarks ranging from simple BLAS kernels to image processing applications.

international conference on conceptual structures | 2016

High-performance Tensor Contractions for GPUs

Ahmad Abdelfattah; Marc Baboulin; Veselin Dobrev; Jack J. Dongarra; Christopher Earl; Joel Falcou; Azzam Haidar; Ian Karlin; Tzanio V. Kolev; Ian Masliah; Stanimire Tomov

We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we demonstrate close to peak performance results. In particular, to accelerate large scale tensor-formulated high-order finite element method (FEM) simulations, which is the main focus and motivation for this work, we represent contractions as tensor index reordering plus matrix-matrix multiplications (GEMMs). This is a key factor to achieve algorithmically many-fold acceleration (vs. not using it) due to possible reuse of data loaded in fast memory. In addition to using this context knowledge, we design tensor data-structures, tensor algebra interfaces, and new tensor contraction algorithms and implementations to achieve 90+% of a theoretically derived peak on GPUs. On a K40c GPU for contractions resulting in GEMMs on square matrices of size 8 for example, we are 2.8 faster than CUBLAS, and 8.5 faster than MKL on 16 cores of Intel Xeon E5-2670 (Sandy Bridge) 2.60GHz CPUs. Finally, we apply autotuning and code generation techniques to simplify tuning and provide an architecture-aware, user-friendly interface.

european conference on parallel processing | 2016

High-Performance Matrix-Matrix Multiplications of Very Small Matrices

Ian Masliah; Ahmad Abdelfattah; Azzam Haidar; Stanimire Tomov; Marc Baboulin; Joel Falcou; Jack J. Dongarra

The use of the general dense matrix-matrix multiplication GEMM is fundamental for obtaining high performance in many scientific computing applications. GEMMs for small matrices of sizes less than 32 however, are not sufficiently optimized in existing libraries. In this paper we consider the case of many small GEMMs on either CPU or GPU architectures. This is a case that often occurs in applications like big data analytics, machine learning, high-order FEM, and others. The GEMMs are grouped together in a single batched routine. We present specialized for these cases algorithms and optimization techniques to obtain performance that is within 90i¾ź% of the optimal. We show that these results outperform currently available state-of-the-art implementations and vendor-tuned math libraries.

acm sigplan symposium on principles and practice of parallel programming | 2014

Boost.SIMD: generic programming for portable SIMDization

Pierre Esterie; Joel Falcou; Mathias Gaunard; Jean-Thierry Lapresté

SIMD extensions have been a feature of choice for processor manufacturers for a couple of decades. Designed to exploit data parallelism in applications at the instruction level and provide significant accelerations, these extensions still require a high level of expertise or the use of potentially fragile compiler support or vendor-specific libraries. In this poster, we present Boost.SIMD, a C++ template library that simplifies the exploitation of SIMD hardware within a standard C++ programming model.

Proceedings of the fourth international workshop on High-level parallel programming and applications | 2010

Hybrid bulk synchronous parallelism library for clustered smp architectures

Khaled Hamidouche; Joel Falcou; Daniel Etiemble

This paper presents the design and implementation of BSP++, a C++ parallel programming library based on the Bulk Synchronous Parallelism model to perform high performance computing on both SMP and SPMD architectures using OpenMPI and MPI. We show how C++ support for genericity provides a functional and intuitive user interface which still delivers a large fraction of performance compared to hand written code. We show how the library structure and programming models allow simple hybrid programming by composing BSP super-steps and letting BSP++ handling the middleware interface. The performance and scalability of this approach are then assessed by various benchmarks of classic HPC application kernels and distributed algorithms on various hybrid machines including a subset of the GRID5000 grid.

Journal of Parallel and Distributed Computing | 2014

The numerical template toolbox

Pierre Esterie; Joel Falcou; Mathias Gaunard; Jean-Thierry Lapresté; Lionel Lacassagne

The design and implementation of high level tools for parallel programming is a major challenge as the complexity of modern architectures increases. Domain Specific Languages (or DSL) have been proposed as a solution to facilitate this design but few of those DSLs?actually take full advantage of said parallel architectures. In this paper, we propose a library-based solution by designing a C++ ? DSLs?using generative programming: NT 2 . By adapting generative programming idioms so that architecture specificities become mere parameters of the code generation process, we demonstrate that our library can deliver high performance while featuring a high level API and being easy to extend over new architectures. We propose a design strategy for DSEL including architecture information in the process.We propose an implementation of a C++ DSEL for scientific computing using this strategy.Implementation demonstrates extensibility on new architectures.Benchmarks show that hierarchical architecture features are properly used.

Explore More