Patrick Carribault | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Patrick Carribault is active.

Explore More

Publication

Featured researches published by Patrick Carribault.

acm symposium on parallel algorithms and architectures | 2008

Scheduling strategies for optimistic parallel execution of irregular programs

Milind Kulkarni; Patrick Carribault; Keshav Pingali; Ganesh Ramanarayanan; Bruce Walter; Kavita Bala; L. Paul Chew

Recent application studies have shown that many irregular applications have a generalized data parallelism that manifests itself as iterative computations over worklists of different kinds. In general, there are complex dependencies between iterations. These dependencies cannot be elucidated statically because they depend on the inputs to the program; thus, optimistic parallel execution is the only tractable approach to parallelizing these applications. We have built a system called Galois that supports this style of parallel execution. Its main features are (i) set iterators for expressing worklist-based data parallelism, and (ii) a runtime system that performs optimistic parallelization of these iterators, detecting conflicts and rolling back computations as needed. Our work builds on the Galois system, and it addresses the problem of scheduling iterations of set iterators on multiple cores. The policy used by the base Galois system is to assign an iteration to a core whenever it needs work to do, but we show in this paper that this policy is not optimal for many applications. We also argue that OpenMP-style DO-ALL loop scheduling directives such as chunked and guided self-scheduling are too simplistic for irregular programs. These difficulties led us to develop a general scheduling framework for irregular problems; OpenMP-style scheduling strategies are special cases of this general approach. We also provide hooks into our framework, allowing the programmer to leverage application knowledge to further tune a schedule for a particular application. To evaluate this framework, we implemented it as an extension of the Galois system. We then tested the system using five real-world, irregular, data-parallel applications. Our results show that (i) the optimal scheduling policy can be different for different applications and often leverages application-specific knowledge and (ii) implementing these schedules in the Galois system is relatively straightforward.

international conference on parallel architectures and compilation techniques | 2005

Deep jam: conversion of coarse-grain parallelism to instruction-level and vector parallelism for irregular applications

Patrick Carribault; Albert Cohen; William Jalby

A number of compute-intensive applications suffer from performance loss due to the lack of instruction-level parallelism in sequences of dependent instructions. This is particularly accurate on wide-issue architectures with large register banks, when the memory hierarchy (locality and bandwidth) is not the dominant bottleneck. We consider two real applications from computational biology and from cryptanalysis, characterized by long sequences of dependent instructions, irregular control-flow and intricate scalar and array dependence patterns. Although these applications exhibit excellent memory locality and branch-prediction behavior, state-of-the-art loop transformations and back-end optimizations are unable to exploit much instruction-level parallelism. We show that good speedups can be achieved through deep jam, a new transformation of the program control- and data-flow. Deep jam combines scalar and array renaming with a generalized form of recursive unroll-and-jam; it brings together independent instructions across irregular control structures, removing memory-based dependences. This optimization contributes to the extraction of fine-grain parallelism in irregular applications. We propose a feedback-directed deep jam algorithm, selecting a jamming strategy, function of the architecture and application characteristics.

symposium on code generation and optimization | 2007

Loop Optimization using Hierarchical Compilation and Kernel Decomposition

Denis Barthou; Sébastien Donadio; Patrick Carribault; Alexandre Duchateau; William Jalby

The increasing complexity of hardware features for recent processors makes high performance code generation very challenging. In particular, several optimization targets have to be pursued simultaneously (minimizing L1/L2/L3/TLB misses and maximizing instruction level parallelism). Very often, these optimization goals impose different and contradictory constraints on the transformations to be applied. We propose a new hierarchical compilation approach for the generation of high performance code relying on the use of state-of-the-art compilers. This approach is not application-dependent and do not require any assembly hand-coding. It relies on the decomposition of the original loop nest into simpler kernels, typically 1D to 2D loops, much simpler to optimize. We successfully applied this approach to optimize dense matrix muliply primitives (not only for the square case but to the more general rectangular cases) and convolution. The performance of the optimized codes on Itanium 2 and Pentium 4 architectures outperforms ATLAS and in most cases, matches hand-tuned vendor libraries (e.g. MKL)

ieee international conference on high performance computing data and analytics | 2014

PARCOACH: Combining static and dynamic validation of MPI collective communications

Emmanuelle Saillard; Patrick Carribault; Denis Barthou

Nowadays most scientific applications are parallelized based on MPI communications. Collective MPI communications have to be executed in the same order by all processes in their communicator and the same number of times, otherwise they do not conform to the standard and a deadlock or other undefined behaviour can occur. As soon as the control flow involving these collective operations becomes more complex, in particular including conditionals on process ranks, ensuring the correction of such code is error-prone. We propose in this paper a static analysis to detect when such a situation occurs, combined with a code transformation that prevents deadlocking. We focus on blocking MPI collective operations in single program multiple data applications, assuming MPI calls are not nested in multithreaded regions. We show on several benchmarks the small impact on performance and the ease of integration of our techniques in the development process.

Proceedings of the 20th European MPI Users' Group Meeting on | 2013

Combining static and dynamic validation of MPI collective communications

Emmanuelle Saillard; Patrick Carribault; Denis Barthou

Collective MPI communications have to be executed in the same order by all processes in their communicator and the same number of times, otherwise a deadlock occurs. As soon as the control-flow involving these collective operations becomes more complex, in particular including conditionals on process ranks, ensuring the correction of such code is error-prone. We propose in this paper a static analysis to detect when such situation occurs, combined with a code transformation that prevents from deadlocking. We show on several benchmarks the small impact on performance and the ease of integration of our techniques in the development process.

Concurrency and Computation: Practice and Experience | 2015

Fine-grain data management directory for OpenMP 4.0 and OpenACC

Julien Jaeger; Patrick Carribault; Marc Pérache

Todays trend to use accelerators in heterogeneous systems forces a paradigm shift in programming models. The use of low‐level APIs for accelerator programming is tedious and not intuitive for casual programmers. To tackle this problem, recent approaches focused on high‐level directive‐based models, with a standardization effort made with OpenACC and the directives for accelerator in the latest OpenMP 4.0 release. The pragmas for data management automatically handle data exchange between the host and the device. To keep the runtime simple and efficient, severe restrictions hinder the use of these pragmas. To address this issue, we propose the design for a directory, along with a reduced runtime application binary interface, to handle correctly data management in these standards. A few improvements to our directory allow a more flexible use of data management pragmas, with negligible overhead. Our design fits a multi‐accelerator system. Copyright

international workshop on openmp | 2014

Static Validation of Barriers and Worksharing Constructs in OpenMP Applications

Emmanuelle Saillard; Patrick Carribault; Denis Barthou

The OpenMP specification requires that all threads in a team execute the same sequence of worksharing and barrier regions. An improper use of such directive may lead to deadlocks. In this paper we propose a static analysis to ensure this property is verified. The well-defined semantic of OpenMP programs makes compiler analysis more effective. We propose a new compile-time method to identify in OpenMP codes the potential improper uses of barriers and worksharing constructs, and the execution paths that are responsible for these issues. We implemented our method in a GCC compiler plugin and show the small impact of our analysis on performance for NAS-OMP benchmarks and a test case for a production industrial code.

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface | 2012

Improving MPI communication overlap with collaborative polling

Sylvain Didelot; Patrick Carribault; Marc Pérache; William Jalby

With the rise of parallel applications complexity, the needs in term of computational power are continually growing. Recent trends in High-Performance Computing (HPC) have shown that improvements in single-core performance will not be sufficient to face the challenges of an Exascale machine: we expect an enormous growth of the number of cores as well as a multiplication of the data volume exchanged across compute nodes. To scale applications up to Exascale, the communication layer has to minimize the time while waiting for network messages. This paper presents a message progression based on Collaborative Polling which allows an efficient auto-adaptive overlapping of communication phases by performing computing. This approach is new as it increases the application overlap potential without introducing overheads of a threaded message progression.

Proceedings of the 22nd European MPI Users' Group Meeting on | 2015

An MPI Halo-Cell Implementation for Zero-Copy Abstraction

Jean-Baptiste Besnard; Allen D. Malony; Sameer Shende; Marc Pérache; Patrick Carribault; Julien Jaeger

In the race for Exascale, the advent of many-core processors will bring a shift in parallel computing architectures to systems of much higher concurrency, but with a relatively smaller memory per thread. This shift raises concerns for the adaptability of HPC software, for the current generation to the brave new world. In this paper, we study domain splitting on an increasing number of memory areas as an example problem where negative performance impact on computation could arise. We identify the specific parameters that drive scalability for this problem, and then model the halo-cell ratio on common mesh topologies to study the memory and communication implications. Such analysis argues for the use of shared-memory parallelism, such as with OpenMP, to address the performance problems that could occur. In contrast, we propose an original solution based entirely on MPI programming semantics, while providing the performance advantages of hybrid parallel programming. Our solution transparently replaces halo-cells transfers with pointer exchanges when MPI tasks are running on the same node, effectively removing memory copies. The results we present demonstrate gains in terms of memory and computation time on Xeon Phi (compared to OpenMP-only and MPI-only) using a representative domain decomposition benchmark.

ieee international conference on high performance computing data and analytics | 2004

Branch strategies to optimize decision trees for wide-issue architectures

Patrick Carribault; Christophe Lemuet; Jean-Thomas Acquaviva; Albert Cohen; William Jalby

Branch predictors are associated with critical design issues for nowadays instruction greedy processors. We study two important domains where the optimization of decision trees — implemented through switch-case or nested if-then-else constructs — makes the precise modeling of these hardware mechanisms determining for performance: compute-intensive libraries with versioning and cloning, and high-performance interpreters. Against common belief, the complexity of recent microarchitectures does not necessarily hamper the design of accurate cost models, in the special case of decision trees. We build a simple model that illustrates the reasons for which decision tree performance is predictable. Based on this model, we compare the most significant code generation strategies on the Itanium2 processor. We show that no strategy dominates in all cases, and although they used to be penalized by traditional superscalar processors, indirect branches regain a lot of interest in the context of predicated execution and delayed branches. We validate our study with an improvement from 15% to 40% over Intel ICC compiler for a Daxpy code focused on short vectors.

Explore More