Patrick G. Bridges | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Patrick G. Bridges is active.

Explore More

Publication

Featured researches published by Patrick G. Bridges.

ieee international conference on high performance computing data and analytics | 2011

Evaluating the viability of process replication reliability for exascale systems

Kurt Brian Ferreira; Jon Stearley; James H. Laros; Ron A. Oldfield; Kevin Pedretti; Ronald B. Brightwell; Rolf Riesen; Patrick G. Bridges; Dorian C. Arnold

As high-end computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are increasingly problematic at these scales due to excessive overheads predicted to more than double an applications time to solution. Replicated computing techniques, particularly state machine replication, long used in distributed and mission critical systems, have been suggested as an alternative to checkpoint-restart. In this paper, we evaluate the viability of using state machine replication as the primary fault tolerance mechanism for upcoming exascale systems. We use a combination of modeling, empirical analysis, and simulation to study the costs and benefits of this approach in comparison to check-point/restart on a wide range of system parameters. These results, which cover different failure distributions, hardware mean time to failures, and I/O bandwidths, show that state machine replication is a potentially useful technique for meeting the fault tolerance demands of HPC applications on future exascale platforms.

ieee international conference on high performance computing data and analytics | 2008

Characterizing application sensitivity to OS interference using kernel-level noise injection

Kurt B. Ferreira; Patrick G. Bridges; Ron Brightwell

Operating system noise has been shown to be a key limiter of application scalability in high-end systems. While several studies have attempted to quantify the sources and effects of system interference using user-level mechanisms, there are few published studies on the effect of different kinds of kernel-generated noise on application performance at scale. In this paper, we examine the sensitivity of real-world, large-scale applications to a range of OS noise patterns using a kernel-based noise injection mechanism implemented in the Catamount lightweight kernel. Our results demonstrate the importance of how noise is generated, in terms of frequency and duration, and how this impact changes with application scale. For example, our results show that 2.5% net processor noise at 10,000 nodes can have no impact or can result in over a factor of 20 slowdown for the same application, depending solely on how the noise is generated. We also discuss how the characteristics of the applications we studied, for example computation/communication ratios, collective communication sizes, and other characteristics, related to their tendency to amplify or absorb noise. Finally, we discuss the implications of our findings on the design of new operating systems, middleware, and other system services for high-end parallel systems.

international parallel and distributed processing symposium | 2010

Palacios and Kitten: New high performance operating systems for scalable virtualized and native supercomputing

John R. Lange; Kevin Pedretti; Trammell Hudson; Peter A. Dinda; Zheng Cui; Lei Xia; Patrick G. Bridges; Andy Gocke; Steven Jaconette; Michael J. Levenhagen; Ron Brightwell

Palacios is a new open-source VMM under development at Northwestern University and the University of New Mexico that enables applications executing in a virtualized environment to achieve scalable high performance on large machines. Palacios functions as a modularized extension to Kitten, a high performance operating system being developed at Sandia National Laboratories to support large-scale supercomputing applications. Together, Palacios and Kitten provide a thin layer over the hardware to support full-featured virtualized environments alongside Kittens lightweight native environment. Palacios supports existing, unmodified applications and operating systems by using the hardware virtualization technologies in recent AMD and Intel processors. Additionally, Palacios leverages Kittens simple memory management scheme to enable low-overhead pass-through of native devices to a virtualized environment. We describe the design, implementation, and integration of Palacios and Kitten. Our benchmarks show that Palacios provides near native (within 5%), scalable performance for virtualized environments running important parallel applications. This new architecture provides an incremental path for applications to use supercomputers, running specialized lightweight host operating systems, that is not significantly performance-compromised.

IEEE ACM Transactions on Networking | 2007

A configurable and extensible transport protocol

Patrick G. Bridges; Gary Wong; Matti A. Hiltunen; Richard D. Schlichting; Matthew J. Barrick

The ability to configure transport protocols from collections of smaller software modules allows the characteristics of the protocol to be customized for a specific application or network technology. This paper describes a configurable transport protocol system called CTP in which microprotocols implementing individual attributes of transport can be combined into a composite protocol that realizes the desired overall functionality. In addition to describing the overall architecture of CTP and its microprotocols, this paper also presents experiments on both local area and wide area platforms that illustrate the flexibility of CTP and how its ability to match more closely application needs can result in better application performance. The prototype implementation of CTP has been built using the C version of the Cactus microprotocol composition framework running on Linux.

acm special interest group on data communication | 1996

Analysis of techniques to improve protocol processing latency

David Mosberger; Larry L. Peterson; Patrick G. Bridges; Sean W. O'Malley

This paper describes several techniques designed to improve protocol latency, and reports on their effectiveness when measured on a modern RISC machine employing the DEC Alpha processor. We found that the memory system---which has long been known to dominate network throughput---is also a key factor in protocol latency. As a result, improving instruction cache effectiveness can greatly reduce protocol processing overheads. An important metric in this context is the memory cycles per instructions (mCPI), which is the average number of cycles that an instruction stalls waiting for a memory access to complete. The techniques presented in this paper reduce the mCPI by a factor of 1.35 to 5.8. In analyzing the effectiveness of the techniques, we also present a detailed study of the protocol processing behavior of two protocol stacks---TCP/IP and RPC---on a modern RISC processor.

international parallel and distributed processing symposium | 2006

Infiniband scalability in Open MPI

Galen M. Shipman; Timothy S. Woodall; Richard L. Graham; Arthur B. Maccabe; Patrick G. Bridges

Infiniband is becoming an important interconnect technology in high performance computing. Efforts in large scale Infiniband deployments are raising scalability questions in the HPC community. Open MPI, a new open source implementation of the MPI standard targeted for production computing, provides several mechanisms to enhance Infiniband scalability. Initial comparisons with MVAPICH, the most widely used Infiniband MPI implementation, show similar performance but with much better scalability characteristics. Specifically, small message latency is improved by up to 10% in medium/large jobs and memory usage per host is reduced by as much as 300%. In addition, Open MPI provides predictable latency that is close to optimal without sacrificing bandwidth performance

virtual execution environments | 2011

Minimal-overhead virtualization of a large scale supercomputer

John R. Lange; Kevin Pedretti; Peter A. Dinda; Patrick G. Bridges; Chang Bae; Philip Soltero; Alexander Merritt

Virtualization has the potential to dramatically increase the usability and reliability of high performance computing (HPC) systems. However, this potential will remain unrealized unless overheads can be minimized. This is particularly challenging on large scale machines that run carefully crafted HPC OSes supporting tightly-coupled, parallel applications. In this paper, we show how careful use of hardware and VMM features enables the virtualization of a large-scale HPC system, specifically a Cray XT4 machine, with < = 5% overhead on key HPC applications, microbenchmarks, and guests at scales of up to 4096 nodes. We describe three techniques essential for achieving such low overhead: passthrough I/O, workload-sensitive selection of paging mechanisms, and carefully controlled preemption. These techniques are forms of symbiotic virtualization, an approach on which we elaborate.

IEEE Computer | 1999

Joust: a platform for liquid software

John J. Hartman; Peter A. Bigot; Patrick G. Bridges; Brady Montz; Rob Piltz; Oliver Spatscheck; Todd A. Proebsting; Larry L. Peterson; Andy C. Bavier

The authors describe a Java-based platform for liquid software, called Joust, that is specifically designed to support low-level, communication-oriented systems and to avoid the limitations of general-purpose OSs. The authors contrast the platform requirements for communication-oriented liquid software with those of computation-oriented software, identify the limitations of current platforms, and outline the benefits of Joust. They also offer an overview of Scout (the underlying OS upon which Joust is built), its runtime system, and its just-in-time (JIT) compiler.

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface | 2011

libhashckpt: hash-based incremental checkpointing using GPU's

Kurt Brian Ferreira; Rolf Riesen; Ron Brighwell; Patrick G. Bridges; Dorian C. Arnold

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we introduce libhashckpt; a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads, we show the merit of this technique for a certain class of HPC applications.

international conference on parallel processing | 2011

Cooperative Application/OS DRAM fault recovery

Patrick G. Bridges; Mark Hoemmen; Kurt B. Ferreira; Michael A. Heroux; Philip Soltero; Ron Brightwell

Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application / OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results.

Explore More