Franck Cappello | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Franck Cappello is active.

Explore More

Publication

Featured researches published by Franck Cappello.

conference on high performance computing (supercomputing) | 2002

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes

George Bosilca; Aurelien Bouteiller; Franck Cappello; Samir Djilali; Gilles Fedak; Cecile Germain; Thomas Hérault; Pierre Lemarinier; Oleg Lodygensky; Frédéric Magniette; Vincent Néri; Anton Selikhov

Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes. We present MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/roll-back and distributed message logging. MPICH-V architecture relies on Channel Memories, Checkpoint servers and theoretically proven protocols to execute existing or new, SPMD and Master-Worker MPI applications on volatile nodes. To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channels Memories and Checkpoint Servers can be completely configured as well as the node Volatility. We present a detailed performance evaluation of every component of MPICH-V and its global performance for non-trivial parallel applications. Experimental results demonstrate good scalability and high tolerance to node volatility.

Future Generation Computer Systems | 2005

Computing on large-scale distributed systems: Xtrem Web architecture, programming models, security, tests and convergence with grid

Franck Cappello; Samir Djilali; Gilles Fedak; Thomas Hérault; Frédéric Magniette; Vincent Néri; Oleg Lodygensky

Global Computing systems belong to the class of large-scale distributed systems. Their properties high computational, storage and communication performance potentials, high resilience make them attractive in academia and industry as computing infrastructures in complement to more classical infrastructures such as clusters or supercomputers. However, generalizing the use of these systems in a multi-user and multi-parallel programming context involves finding solutions and providing mechanisms for many issues such as programming bag of tasks and message passing parallel applications, securing the application, the system itself and the computing nodes, deploying the systems for harnessing resources managed in different ways. In this paper, we present our research, often influenced by user demands, towards a Computational peer-to-peer system called Xtrem Web. We describe (a) the architecture of the system and its motivations, (b) the parallel programming paradigms available in Xtrem Web and how they are implemented. (c) the deployment issues and what mechanisms are used to harness simultaneously uncoordinated set of resources, and resources managed by batch schedulers and (d) the security issue and how we address, inside Xtrem Web, the protection of the computing resources. We present two multi-parametric applications to be used in production: Aires belonging to the high energy physics (HEP) Auger project and a protein conformation predictor using a molecular dynamic simulator. To evaluate the performance and volatility tolerance, we present experiment results for bag of tasks applications and message passing applications. We show that the system can tolerate massive failure and we discuss the performance of the node protection mechanism. Based on the Xtrem Web project developments and evolutions, we will discuss the convergence between Global Computing systems and Grid.

conference on high performance computing (supercomputing) | 2003

MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Aurelien Bouteiller; Franck Cappello; Thomas Hérault; Géraud Krawezik; Pierre Lemarinier; Frédéric Magniette

Execution of MPI applications on clusters and Grid deployments suffering from node and network failures motivates the use of fault tolerant MPI implementations. We present MPICH-V2 (the second protocol of MPICH-V project), an automatic fault tolerant MPI implementation using an innovative protocol that removes the most limiting factor of the pessimistic message logging approach: reliable logging of in transit messages. MPICH-V2 relies on uncoordinated checkpointing, sender based message logging and remote reliable logging of message logical clocks. This paper presents the architecture of MPICH-V2, its theoretical foundation and the performance of the implementation. We compare MPICH-V2 to MPICH-V1 and MPICH-P4 evaluating a) its point-to-point performance, b) the performance for the NAS benchmarks, c) the application performance when many faults occur during the execution. Experimental results demonstrate that MPICH-V2 provides performance close to MPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPICH-V1.

Concurrency and Computation: Practice and Experience | 2006

Performance comparison of MPI and OpenMP on shared memory multiprocessors

Géraud Krawezik; Franck Cappello

When using a shared memory multiprocessor, the programmer faces the issue of selecting the portable programming model which will provide the best performance. Even if they restricts their choice to the standard programming environments (MPI and OpenMP), they have to select a programming approach among MPI and the variety of OpenMP programming styles.

conference on high performance computing (supercomputing) | 2004

RPC-V: Toward Fault-Tolerant RPC for Internet Connected Desktop Grids with Volatile Nodes

Samir Djilali; Thomas Hérault; Oleg Lodygensky; Tangui Morlier; Gilles Fedak; Franck Cappello

RPC is one of the programming models envisioned for the Grid. In Internet connected Large Scale Grids such as Desktop Grids, nodes and networks failures are not rare events. This paper provides several contributions, examining the feasibility and limits of fault-tolerant RPC on these platforms. First, we characterize these Grids from their fundamental features and demonstrate that their applications scope should be safely restricted to stateless services. Second, we present a new fault-tolerant RPC protocol associating an original combination of three-tier architecture, passive replication and message logging. We describe RPC-V, an implementation of the proposed protocol within the XtremWeb Desktop Grid middleware. Third, we evaluate the performance of RPC-V and the impact of faults on the execution time, using a real life application on a Desktop Grid testbed assembling nodes in France and USA. We demonstrate that RPC-V allows the applications to continue their execution while key system components fail.

ieee international conference on high performance computing data and analytics | 2004

Coordinated checkpoint versus message log for fault tolerant MPI

Pierre Lemarinier; Aurelien Bouteiller; Géraud Krawezik; Franck Cappello

Large clusters, high availability clusters and grid deployments often suffer from network, node or operating system faults and thus require the use of fault tolerant programming models. MPI is one of the most widely adopted programming models for high performance computing. There are several approaches for fault tolerance in an MPI environment. The automatic and transparent ones are based on either coordinated or uncoordinated checkpoint associated with a message log strategy. There are many protocols and optimisations for these approaches and several implementations have been made. However, few results of comparison between them exist. Coordinated checkpoint has the advantage of a very low overhead as long as the execution stays fault free. In contrast, uncoordinated checkpoint must be complemented by a message log protocol which adds a significant penalty for all message transfers even for fault free executions. The drawbacks of coordinated checkpoint are the synchronisation cost before the checkpoint, the synchronised checkpoint cost and the restart cost after a fault. Message log does not suffer from these problems, as it processes checkpoint and restart independently. These differences suggest that the best approach depends on the fault frequency. This paper investigates this question from a fair experimental protocol: we implement and test two protocols (coordinated checkpoint and pessimistic message log) on the same system and we compare them on a cluster according to the frequency of faults that are generated artificially. The main conclusion is that uncoordinated checkpoint is relevant for a large scale cluster from one fault every hour for applications with large dataset.

international conference on parallel architectures and languages europe | 1992

PTAH: Introduction to a New Parallel Architecture for Highly Numeric Processing

Franck Cappello; Jean-Luc Béchennec; Jean-Louis Giavitto

This paper proposes a new architectural design for high performance parallel computers: the one-cycle machine. In such a computer the memory access, network access, instruction sequencing, data computation take the same duration: one clock cycle. We first consider the communication network efficiency as the main critical resource. We show that the adaptation of the network performance to the processing element power is more important than the CPU power in itself with respect to the global processing effectiveness. Two guidelines are derived from our analysis and conduct to the design of PTAH. Two simple examples are used to illustrate the interest of PTAH for the execution of numeric applications. Finally, some hardware features are proposed for a PTAH implementation being able to reach the TeraFLOPS.

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2002

MPICH-CM: A Communication Library Design for a P2P MPI Implementation

Anton Selikhov; George Bosilca; Cecile Germain; Gilles Fedak; Franck Cappello

The paper presents MPICH-CM - a new architecture of communications in message-passing systems, developed for MPICH-V - a MPI implementation for P2P systems. MPICH-CM implies communications between nodes through special Channel Memories introducing fully decoupled communication media. Some new properties of communications based on MPICH-CM are described in comparison with other communication architectures, with emphasis on grid-like and volunteer computing systems. The first implementation of MPICH-CM is performed as a special MPICH device connected with Channel Memory servers. To estimate the overhead of MPICH-CM, the performance of MPICH-CM is presented for basic point-to-point and collective operations in comparison with MPICH p4 implementation.

grid computing | 2008

OpenWP: Combining annotation language and workflow environments for porting existing applications on grids

Matthieu Cargnelli; Guillaume Alléon; Franck Cappello

Many Industrial companies are looking for programming environments to port their existing applications onto the grid. Workflow environments are an appealing solution for them as they match the architectural features of grids (hierarchy, heterogeneity, dynamism). However,current workflow environments require from programmers significant efforts to adapt existing applications. In this paper, we propose the OpenWP programming and runtime environment, in order to ease the adaptation and execution of existing applications onto grids. OpenWP essentially allows the programmer:1) to express the parallelism and distribution in existing codes using directives and 2) to execute the applications on grids using existing workflow engines. This paper presents the OpenWP environment in details and evaluates its performance with a non trivial industrial mesher application used by an aerospace company.

Archive | 2001