Is this you? Create Your Porfile

Al Geist

Oak Ridge National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Al Geist is active.

Explore More

Publication

Featured researches published by Al Geist.

Computers in Physics | 1995

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

Al Geist; Adam Beguelin; Jack J. Dongarra; Weicheng Jiang; Robert Manchek; Vaidy S. Sunderam

Part 1 Introduction: heterogeneous network computing trends in distributed computing PVM overview other packages. Part 2 The PVM system. Part 3 Using PVM: how to obtain the PVM software setup to use PVM setup summary starting PVM common startup problems running PVM programs PVM console details host file options. Part 4 Basic programming techniques: common parallel programming paradigms workload allocation porting existing applications to PVM. Part 5 PVM user interface: process control information dynamic configuration signalling setting and getting options message passing dynamic process groups. Part 6 Program examples: fork-join dot product failure matrix multiply one-dimensional heat equation. Part 7 How PVM works: components messages PVM daemon libpvm library protocols message routing task environment console program resource limitations multiprocessor systems. Part 8 Advanced topics: XPVM porting PVM to new architectures. Part 9 Troubleshooting: geting PVM installed getting PVM running compiling applications running applications debugging and tracing debugging the system. Appendices: history of PVM versions PVM 3 routines.

ieee international conference on high performance computing data and analytics | 2011

The International Exascale Software Project roadmap

Jack J. Dongarra; Pete Beckman; Terry Moore; Patrick Aerts; Giovanni Aloisio; Jean Claude Andre; David Barkai; Jean Yves Berthou; Taisuke Boku; Bertrand Braunschweig; Franck Cappello; Barbara M. Chapman; Xuebin Chi; Alok N. Choudhary; Sudip S. Dosanjh; Thom H. Dunning; Sandro Fiore; Al Geist; Bill Gropp; Robert J. Harrison; Mark Hereld; Michael A. Heroux; Adolfy Hoisie; Koh Hotta; Zhong Jin; Yutaka Ishikawa; Fred Johnson; Sanjay Kale; R.D. Kenway; David E. Keyes

Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics processing units. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.

ieee international conference on high performance computing data and analytics | 2009

Toward Exascale Resilience

Franck Cappello; Al Geist; Bill Gropp; Laxmikant V. Kalé; Bill Kramer; Marc Snir

Over the past few years resilience has became a major issue for high-performance computing (HPC) systems, in particular in the perspective of large petascale systems and future exascale systems. These systems will typically gather from half a million to several millions of central processing unit (CPU) cores running up to a billion threads. From the current knowledge and observations of existing large systems, it is anticipated that exascale systems will experience various kind of faults many times per day. It is also anticipated that the current approach for resilience, which relies on automatic or application level checkpoint/ restart, will not work because the time for checkpointing and restarting will exceed the mean time to failure of a full system. This set of projections leaves the community of fault tolerance for HPC systems with a difficult challenge: finding new approaches, which are possibly radically disruptive, to run applications until their normal termination, despite the essentially unstable nature of exascale systems. Yet, the community has only five to six years to solve the problem. This white paper synthesizes the motivations, observations and research issues considered as determinant of several complimentary experts of HPC in applications, programming models, distributed systems and system management.

ieee international conference on high performance computing data and analytics | 2014

Addressing failures in exascale computing

Marc Snir; Robert W. Wisniewski; Jacob A. Abraham; Sarita V. Adve; Saurabh Bagchi; Pavan Balaji; Jim Belak; Pradip Bose; Franck Cappello; Bill Carlson; Andrew A. Chien; Paul W. Coteus; Nathan DeBardeleben; Pedro C. Diniz; Christian Engelmann; Mattan Erez; Saverio Fazzari; Al Geist; Rinku Gupta; Fred Johnson; Sriram Krishnamoorthy; Sven Leyffer; Dean A. Liberty; Subhasish Mitra; Todd S. Munson; Rob Schreiber; Jon Stearley; Eric Van Hensbergen

We present here a report produced by a workshop on ‘Addressing failures in exascale computing’ held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach. The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.

european conference on parallel processing | 1996

MPI-2: Extending the Message-Passing Interface

Al Geist; William Gropp; Steven Huss-Lederman; Andrew Lumsdaine; Ewing L. Lusk; William Saphir; Tony Skjellum; Marc Snir

This paper describes current activities of the MPI-2 Forum. The MPI-2 Forum is a group of parallel computer vendors, library writers, and application specialists working together to define a set of extensions to MPI (Message Passing Interface). MPI was defined by the same process and now has many implementations, both vendor-proprietary and publicly available, for a wide variety of parallel computing environments. In this paper we present the salient aspects of the evolving MPI-2 document as it now stands. We discuss proposed extensions and enhancements to MPI in the areas of dynamic process management, one-sided operations, collective operations, new language binding, real-time computing, external interfaces, and miscellaneous topics.

Distributed and Parallel Databases | 2002

RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets

Nagiza F. Samatova; George Ostrouchov; Al Geist; Anatoli V. Melechko

This paper presents a hierarchical clustering method named RACHET (Recursive Agglomeration of Clustering Hierarchies by Encircling Tactic) for analyzing multi-dimensional distributed data. A typical clustering algorithm requires bringing all the data in a centralized warehouse. This results in O(nd) transmission cost, where n is the number of data points and d is the number of dimensions. For large datasets, this is prohibitively expensive. In contrast, RACHET runs with at most O(n) time, space, and communication costs to build a global hierarchy of comparable clustering quality by merging locally generated clustering hierarchies. RACHET employs the encircling tactic in which the merges at each stage are chosen so as to minimize the volume of a covering hypersphere. For each cluster centroid, RACHET maintains descriptive statistics of constant complexity to enable these choices. RACHETs framework is applicable to a wide class of centroid-based hierarchical clustering algorithms, such as centroid, medoid, and Ward.

conference on high performance computing (supercomputing) | 1997

Scalable Networked Information Processing Environment (SNIPE)

Graham E. Fagg; Keith Moore; Jack J. Dongarra; Al Geist

SNIPE is a metacomputing system that aims to provide a reliable, secure, fault-tolerant environment for long-term distributed computing applications and data stores across the global InterNet. This system combines global naming and replication of both processing and data to support large scale information processing applications leading to better availability and reliability than currently available with typical cluster computing and/or distributed computer environments. To facilitate this the system supports: distributed data collection, distributed computation, distributed control and resource management, distributed output and process migration. The underlying system supports multiple communication paths, media and routing methods to aid performance and robustness across both local and global networks.

international conference on parallel processing | 2009

CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems

Rinku Gupta; Peter H. Beckman; Byung-Hoon Park; Ewing L. Lusk; Paul Hargrove; Al Geist; Dhabaleswar K. Panda; Andrew Lumsdaine; Jack J. Dongarra

Considerable work has been done on providing fault tolerance capabilities for different software components on large-scale high-end computing systems. Thus far, however, these fault-tolerant components have worked insularly and independently and information about faults is rarely shared. Such lack of system-wide fault tolerance is emerging as one of the biggest problems on leadership-class systems. In this paper, we propose a coordinated infrastructure, named CIFTS, that enables system software components to share fault information with each other and adapt to faults in a holistic manner. Central to the CIFTS infrastructure is a Fault Tolerance Backplane (FTB) that enables fault notification and awareness throughout the software stack, including fault-aware libraries, middleware, and applications. We present details of the CIFTS infrastructure and the interface specification that has allowed various software programs, including MPICH2, MVAPICH, Open MPI, and PVFS, to plug into the CIFTS infrastructure. Further, through a detailed evaluation we demonstrate the nonintrusive low-overhead capability of CIFTS that lets applications run with minimal performance degradation.

dependable systems and networks | 2009

System log pre-processing to improve failure prediction

Ziming Zheng; Zhiling Lan; Byung-Hoon Park; Al Geist

Log preprocessing, a process applied on the raw log before applying a predictive method, is of paramount importance to failure prediction and diagnosis. While existing filtering methods have demonstrated good compression rate, they fail to preserve important failure patterns that are crucial for failure analysis. To address the problem, in this paper we present a log preprocessing method. It consists of three integrated steps: (1) event categorization to uniformly classify system events and identify fatal events; (2) event filtering to remove temporal and spatial redundant records, while also preserving necessary failure patterns for failure analysis; (3) causality-related filtering to combine correlated events for filtering through apriori association rule mining. We demonstrate the effectiveness of our preprocessing method by using real failure logs collected from the Cray XT4 at ORNL and the Blue Gene/L system at SDSC. Experiments show that our method can preserve more failure patterns for failure analysis, thereby improving failure prediction by up to 174%.

ieee international conference on high performance computing data and analytics | 2009

Major Computer Science Challenges At Exascale

Al Geist; Robert F. Lucas

Exascale systems will provide an unprecedented opportunity for science, one that will make it possible to use computation not only as a critical tool along with theory and experiment in understanding the behavior of the fundamental components of nature, but also for critical advances for the nation’s energy needs and security. To create exascale systems and software that will enable the US Department of Energy (DOE) to meet the science goals critical to the nation’s energy, ecological sustainability, and global security, we must focus on major architecture, software, algorithm, and data challenges, and build on newly emerging programming environments. Only with this new infrastructure will applications be able to scale up to the required levels of parallelism and integrate technologies into complex coupled systems for real-world multidisciplinary modeling and simulation. Achieving this goal will likely involve a shift from current static approaches for application development and execution to a combination of new software tools, algorithms, and dynamically adaptive methods.

Explore More