Franck Cappello | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Franck Cappello is active.

Explore More

Publication

Featured researches published by Franck Cappello.

ieee international conference on high performance computing data and analytics | 2011

The International Exascale Software Project roadmap

Jack J. Dongarra; Pete Beckman; Terry Moore; Patrick Aerts; Giovanni Aloisio; Jean Claude Andre; David Barkai; Jean Yves Berthou; Taisuke Boku; Bertrand Braunschweig; Franck Cappello; Barbara M. Chapman; Xuebin Chi; Alok N. Choudhary; Sudip S. Dosanjh; Thom H. Dunning; Sandro Fiore; Al Geist; Bill Gropp; Robert J. Harrison; Mark Hereld; Michael A. Heroux; Adolfy Hoisie; Koh Hotta; Zhong Jin; Yutaka Ishikawa; Fred Johnson; Sanjay Kale; R.D. Kenway; David E. Keyes

Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics processing units. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.

cluster computing and the grid | 2001

XtremWeb: a generic global computing system

Gilles Fedak; Cécile Germain; Vincent Néri; Franck Cappello

Global computing achieves high throughput computing by harvesting a very large number of unused computing resources connected to the Internet. This parallel computing model targets a parallel architecture defined by a very high number of nodes, poor communication performance and continuously varying resources. The unprecedented scale of the global computing architecture paradigm requires us to revisit many basic issues related to parallel architecture programming models, performance models, and class of applications or algorithms suitable for this architecture. XtremWeb is an experimental global computing platform dedicated to provide a tool for such studies. The paper presents the design of XtremWeb. Two essential features of this design are multi-applications and high-performance. Accepting multiple applications allows institutions or enterprises to set up their own global computing applications or experiments. High-performance is ensured by scalability, fault tolerance, efficient scheduling and a large base of volunteer PCs. We also present an implementation of the first global application running on XtremWeb.

international parallel and distributed processing symposium | 2009

Cost-benefit analysis of Cloud Computing versus desktop grids

Derrick Kondo; Bahman Javadi; Paul Malecot; Franck Cappello; David P. Anderson

Cloud Computing has taken commercial computing by storm. However, adoption of cloud computing platforms and services by the scientific community is in its infancy as the performance and monetary cost-benefits for scientific applications are not perfectly clear. This is especially true for desktop grids (aka volunteer computing) applications. We compare and contrast the performance and monetary cost-benefits of clouds for desktop grid applications, ranging in computational size and storage. We address the following questions: (i) What are the performance tradeoffs in using one platform over the other? (ii) What are the specific resource requirements and monetary costs of creating and deploying applications on each platform? (iii) In light of those monetary and performance cost-benefits, how do these platforms compare? (iv) Can cloud computing platforms be used in combination with desktop grids to improve cost-effectiveness even further? We examine those questions using performance measurements and monetary expenses of real desktop grids and the Amazon elastic compute cloud.

ieee international conference on high performance computing data and analytics | 2009

Toward Exascale Resilience

Franck Cappello; Al Geist; Bill Gropp; Laxmikant V. Kalé; Bill Kramer; Marc Snir

Over the past few years resilience has became a major issue for high-performance computing (HPC) systems, in particular in the perspective of large petascale systems and future exascale systems. These systems will typically gather from half a million to several millions of central processing unit (CPU) cores running up to a billion threads. From the current knowledge and observations of existing large systems, it is anticipated that exascale systems will experience various kind of faults many times per day. It is also anticipated that the current approach for resilience, which relies on automatic or application level checkpoint/ restart, will not work because the time for checkpointing and restarting will exceed the mean time to failure of a full system. This set of projections leaves the community of fault tolerance for HPC systems with a difficult challenge: finding new approaches, which are possibly radically disruptive, to run applications until their normal termination, despite the essentially unstable nature of exascale systems. Yet, the community has only five to six years to solve the problem. This white paper synthesizes the motivations, observations and research issues considered as determinant of several complimentary experts of HPC in applications, programming models, distributed systems and system management.

conference on high performance computing (supercomputing) | 2000

MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks

Franck Cappello; Daniel Etiemble

The hybrid memory model of clusters of multiprocessors raises two issues: programming model and performance. Many parallel programs have been written by using the MPI standard. To evaluate the pertinence of hybrid models for existing MPI codes, we compare a unified model (MPI) and a hybrid one (OpenMP fine grain parallelization after profiling) for the NAS 2.3 benchmarks on two IBM SP systems. The superiority of one model depends on 1) the level of shared memory model parallelization, 2) the communication patterns and 3) the memory access patterns. The relative speeds of the main architecture components (CPU, memory, and network) are of tremendous importance for selecting one model. With the used hybrid model, our results show that a unified MPI approach is better for most of the benchmarks. The hybrid approach becomes better only when fast processors make the communication performance significant and the level of parallelization is sufficient.

ieee international conference on high performance computing data and analytics | 2011

FTI: high performance fault tolerance interface for hybrid systems

Leonardo Bautista-Gomez; Seiji Tsuboi; Dimitri Komatitsch; Franck Cappello; Naoya Maruyama; Satoshi Matsuoka

Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in which we integrate a highly-reliable topology-aware Reed-Solomon encoding in a three-level checkpoint scheme. We efficiently hide the encoding time using one Fault-Tolerance dedicated thread per node. We implement our technique in the Fault Tolerance Interface FTI. We evaluate the correctness of our performance model and conduct a study of the reliability of our library. To demonstrate the performance of FTI, we present a case study of the Mw9.0 Tohoku Japan earthquake simulation with SPECFEM3D on TSUBAME2.0. We demonstrate a checkpoint overhead as low as 8% on sustained 0.1 petaflops runs (1152 GPUs) while check-pointing at high frequency.

ieee international conference on high performance computing data and analytics | 2009

Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

Franck Cappello

The emergence of petascale systems and the promise of future exascale systems have reinvigorated the community interest in how to manage failures in such systems and ensure that large applications, lasting several hours or tens of hours, are completed successfully. Most of the existing results for several key mechanisms associated with fault tolerance in high-performance computing (HPC) platforms follow the rollback—recovery approach. Over the last decade, these mechanisms have received a lot of attention from the community with different levels of success. Unfortunately, despite their high degree of optimization, existing approaches do not fit well with the challenging evolutions of large-scale systems. There is room and even a need for new approaches. Opportunities may come from different origins: diskless checkpointing, algorithmic-based fault tolerance, proactive operation, speculative execution, software transactional memory, forward recovery, etc. The contributions of this paper are as follows: (1) we summarize and analyze the existing results concerning the failures in large-scale computers and point out the urgent need for drastic improvements or disruptive approaches for fault tolerance in these systems; (2) we sketch most of the known opportunities and analyze their associated limitations; (3) we extract and express the challenges that the HPC community will have to face for addressing the stringent issue of failures in HPC systems.

ieee international conference on high performance computing data and analytics | 2014

Addressing failures in exascale computing

Marc Snir; Robert W. Wisniewski; Jacob A. Abraham; Sarita V. Adve; Saurabh Bagchi; Pavan Balaji; Jim Belak; Pradip Bose; Franck Cappello; Bill Carlson; Andrew A. Chien; Paul W. Coteus; Nathan DeBardeleben; Pedro C. Diniz; Christian Engelmann; Mattan Erez; Saverio Fazzari; Al Geist; Rinku Gupta; Fred Johnson; Sriram Krishnamoorthy; Sven Leyffer; Dean A. Liberty; Subhasish Mitra; Todd S. Munson; Rob Schreiber; Jon Stearley; Eric Van Hensbergen

We present here a report produced by a workshop on ‘Addressing failures in exascale computing’ held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach. The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.

ieee international conference on high performance computing data and analytics | 2006

MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI

Aurelien Bouteiller; Thomas Herault; Géraud Krawezik; Pierre Lemarinier; Franck Cappello

High performance computing platforms such as Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing libraries in HPC applications. These two trends raise the need for fault-tolerant MPI. The MPICH-V project focuses on designing, implementing and comparing several automatic fault-tolerant protocols for MPI applications. We present an extensive related work section highlighting the originality of our approach and the proposed protocols. We then present four fault-tolerant protocols implemented in a new generic framework for fault-tolerant protocol comparison, covering a large spectrum of known approaches from coordinated checkpoint, to uncoordinated checkpoint associated with causal message logging. We measure the performance of these protocols on a micro-benchmark and compare them with the NAS benchmark, using an original fault tolerance test. Finally, we outline the lessons learned from this in depth fault-tolerant protocol comparison of MPI applications.

Future Generation Computer Systems | 2007

Characterizing resource availability in enterprise desktop grids

Derrick Kondo; Gilles Fedak; Franck Cappello; Andrew A. Chien; Henri Casanova

Desktop grids, which use the idle cycles of many desktop PCs, are one of the largest distributed systems in the world. Despite the popularity and success of many desktop grid projects, the heterogeneity and volatility of hosts within desktop grids have been poorly understood. Yet, resource characterization is essential for accurate simulation and modelling of such platforms. In this paper, we present application-level traces of four real desktop grids that can be used for simulation and modelling purposes. In addition, we describe aggregate and per host statistics that reflect the heterogeneity and volatility of desktop grid resources. Finally, we apply our characterization to develop a performance model for desktop grid applications for various task granularities, and then use a cluster equivalence metric to quantify the utility of the desktop grid relative to that of a dedicated cluster for task-parallel applications.

Explore More