Edgar Gabriel | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Edgar Gabriel is active.

Explore More

Publication

Featured researches published by Edgar Gabriel.

Lecture Notes in Computer Science | 2004

Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation

Edgar Gabriel; Graham E. Fagg; George Bosilca; Thara Angskun; Jack J. Dongarra; Jeffrey M. Squyres; Vishal Sahay; Prabhanjan Kambadur; Andrew Lumsdaine; Ralph H. Castain; David Daniel; Richard L. Graham; Timothy S. Woodall

A large number of MPI implementations are currently available, each of which emphasize different aspects of high-performance computing or are intended to solve a specific research problem. The result is a myriad of incompatible MPI implementations, all of which require separate installation, and the combination of which present significant logistical challenges for end users. Building upon prior research, and influenced by experience gained from the code bases of the LAM/MPI, LA-MPI, and FT-MPI projects, Open MPI is an all-new, production-quality MPI-2 implementation that is fundamentally centered around component concepts. Open MPI provides a unique combination of novel features previously unavailable in an open-source, production-quality implementation of MPI. Its component architecture provides both a stable platform for third-party research as well as enabling the run-time composition of independent software add-ons. This paper presents a high-level overview the goals, design, and implementation of Open MPI.

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 1998

Distributed Computing in a Heterogeneous Computing Environment

Edgar Gabriel; Michael M. Resch; Thomas Beisel; Rainer Keller

Distributed computing is a means to overcome the limitations of single computing systems. In this paper we describe how clusters of heterogeneous supercomputers can be used to run a single application or a set of applications. We concentrate on the communication problem in such a configuration and present a software library called PACX-MPI that was developed to allow a single system image from the point of view of an MPI programmer. We describe the concepts that have been implemented for heterogeneous clusters of this type and give a description of real applications using this library.

international parallel and distributed processing symposium | 2005

Performance analysis of MPI collective operations

Jelena Pješivac-Grbović; Thara Angskun; George Bosilca; Graham E. Fagg; Edgar Gabriel; Jack J. Dongarra

Previous studies of application usage show that the performance of collective communications are critical for high-performance computing and are often overlooked when compared to the point-to-point performance. In this paper, we analyze and attempt to improve intra-cluster collective communication in the context of the widely deployed MPI programming paradigm by extending accepted models of point-to-point communication, such as Hockney, LogP/LogGP, and PLogP. The predictions from the models were compared to the experimentally gathered data and our findings were used to optimize the implementation of collective operations in the FT-MPI library.

acm sigplan symposium on principles and practice of parallel programming | 2005

Fault tolerant high performance computing by a coding approach

Zizhong Chen; Graham E. Fagg; Edgar Gabriel; Julien Langou; Thara Angskun; George Bosilca; Jack J. Dongarra

As the number of processors in todays high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the execution time of many current high performance computing applications. Although todays architectures are usually robust enough to survive node failures without suffering complete system failure, most todays high performance computing applications can not survive node failures and, therefore, whenever a node fails, have to abort themselves and restart from the beginning or a stable-storage-based checkpoint.This paper explores the use of the floating-point arithmetic coding approach to build fault survivable high performance computing applications so that they can adapt to node failures without aborting themselves. Despite the use of erasure codes over Galois field has been theoretically attempted before in diskless checkpointing, few actual implementations exist. This probably derives from concerns related to both the efficiency and the complexity of implementing such codes in high performance computing applications. In this paper, we introduce the simple but efficient floating-point arithmetic coding approach into diskless checkpointing and address the associated round-off error issue. We also implement a floating-point arithmetic version of the Reed-Solomon coding scheme into a conjugate gradient equation solver and evaluate both the performance and the numerical impact of this scheme. Experimental results demonstrate that the proposed floating-point arithmetic coding approach is able to survive a small number of simultaneous node failures with low performance overhead and little numerical impact.

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 1997

An Extension to MPI for Distributed Computing on MPPs

Thomas Beisel; Edgar Gabriel; Michael M. Resch

We present a tool that allows to run an MPI application on several MPPs without having to change the application code. PACX (Parallel Computer eXtension) provides to the user a distributed MPI environment with most of the important functionality of standard MPI. It is therefore well suited for usage in metacomputing.

ieee international conference on high performance computing data and analytics | 2005

Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing

Graham E. Fagg; Edgar Gabriel; Zizhong Chen; Thara Angskun; George Bosilca; Jelena Pješivac-Grbović; Jack J. Dongarra

With increasing numbers of processors on current machines, the probability for node or link failures is also increasing. Therefore, application-level fault tolerance is becoming more of an important issue for both end-users and the institutions running the machines. In this paper we present the semantics of a fault-tolerant version of the message passing interface (MPI), the de-facto standard for communication in scientific applications, which gives applications the possibility to recover from a node or link error and continue execution in a well-defined way. We present the architecture of fault-tolerant MPI, an implementation of MPI using the semantics presented above as well as benchmark results with various applications. An example of a fault-tolerant parallel equation solver, performance results as well as the time for recovering from a process failure are furthermore detailed.

Journal of Grid Computing | 2003

Towards Efficient Execution of MPI Applications on the Grid: Porting and Optimization Issues

Rainer Keller; Edgar Gabriel; Bettina Krammer; Matthias S. Müller; Michael M. Resch

The message passing interface (MPI) is a standard used by many parallel scientific applications. It offers the advantage of a smoother migration path for porting applications from high performance computing systems to the Grid. In this paper Grid-enabled tools and libraries for developing MPI applications are presented. The first is MARMOT, a tool that checks the adherence of an application to the MPI standard. The second is PACX-MPI, an implementation of the MPI standard optimized for Grid environments. Besides the efficient development of the program, an optimal execution is of paramount importance for most scientific applications. We therefore discuss not only performance on the level of the MPI library, but also several application specific optimizations, e.g., for a sparse, parallel equation solver and an RNA folding code, like latency hiding, prefetching, caching and topology-aware algorithms.

european pvm mpi users group meeting on recent advances in parallel virtual machine and message passing interface | 2009

VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes

Troy P. LeBlanc; Rakhi Anand; Edgar Gabriel; Jaspal Subhlok

The objective of this research is to convert ordinary idle PCs into virtual clusters for executing parallel applications. The paper introduces VolpexMPI that is designed to enable seamless forward application progress in the presence of frequent node failures as well as dynamically changing networks speeds and node execution speeds. Process replication is employed to provide robustness in such volatile environments. The central challenge in VolpexMPI design is to efficiently and automatically manage dynamically varying number of process replicas in different states of execution progress. The key fault tolerance technique employed is fully distributed sender based logging. The paper presents the design and a prototype implementation of VolpexMPI. Preliminary results validate that the overhead of providing robustness is modest for applications having a favorable ratio of communication to computation and a low degree of communication.

Lecture Notes in Computer Science | 2003

Performance Prediction in a Grid Environment

Rosa M. Badia; Francesc Escalé; Edgar Gabriel; Judit Gimenez; Rainer Keller; Jesús Labarta; Matthias S. Müller

Knowing the performance of an application in a Grid environment is an important issue in application development and for scheduling decisions. In this paper we describe the analysis and optimisation of a computation- and communication-intensive application from the field of bioinformatics, which was demonstrated at the HPC-Challenge of Supercomputing 2002 at Baltimore. This application has been adapted to be run on an heterogeneous computational Grid by means of PACX-MPI. The analysis and optimisation is based on trace driven tools, mainly Dimemas and Vampir. All these methodologies and tools are being extended in the frame of the DAMIEN IST project.

Journal of Parallel and Distributed Computing | 2008

Fault tolerant algorithms for heat transfer problems

Hatem Ltaief; Edgar Gabriel; Marc Garbey

With the emergence of new massively parallel systems in the high performance computing area allowing scientific simulations to run on thousands of processors, the mean time between failures of large machines is decreasing from several weeks to a few minutes. The ability of hardware and software components to handle these singular events called process failures is therefore getting increasingly important. In order for a scientific code to continue despite a process failure, the application must be able to retrieve the lost data items. The recovery procedure after failures might be fairly straightforward for elliptic and linear hyperbolic problems. However, the reversibility in time for parabolic problems appears to be the most challenging part because it is an ill-posed problem. This paper focuses on new fault-tolerant numerical schemes for the time integration of parabolic problems. The new algorithm allows the application to recover from process failures and to reconstruct numerically the lost data of the failed process(es) avoiding the expensive roll-back operation required in most checkpoint/restart schemes. As a fault tolerant communication library, we use the fault tolerant message passing interface developed by the Innovative Computing Laboratory at the University of Tennessee. Experimental results show promising performances. Indeed, the three-dimensional parabolic benchmark code is able to recover and to keep on running after failures, adding only a very small penalty to the overall time of execution.

Explore More