James S. Plank | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where James S. Plank is active.

Explore More

Publication

Featured researches published by James S. Plank.

Software - Practice and Experience | 1997

A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems

James S. Plank

SUMMARY It is well-known that Reed-Solomon codes may be used to provide error correction for multiple failures in RAID-like systems. The coding technique itself, however, is not as well-known. To the coding theorist, this technique is a straightforward extension to a basic coding paradigm and needs no special mention. However, to the systems programmer with no training in coding theory, the technique may be a mystery. Currently, there are no references that describe how to perform this coding that do not assume that the reader is already well-versed in algebra and coding theory. This paper is intended for the systems programmer. It presents a complete specification of the coding algorithm plus details on how it may be implemented. This specification assumes no prior knowledge of algebra or coding theory. The goal of this paper is for a systems programmer to be able to implement Reed-Solomon coding for reliability in RAID-like systems without needing to consult any external references. ©1997 by John Wiley & Sons, Ltd.

ieee international conference on high performance computing data and analytics | 2001

Analyzing Market-Based Resource Allocation Strategies for the Computational Grid

Richard Wolski; James S. Plank; John Brevik; Todd Bryan

In this paper, the authors investigate G-commerce—computational economies for controlling resource allocation in computational Grid settings. They define hypothetical resource consumers (representing users and Grid-aware applications) and resource producers (representing resource owners who “sell” their resources to the Grid). The authors then measure the efficiency of resource allocation under two different market conditions—commodities markets and auctions—and compare both market strategies in terms of price stability, market equilibrium, consumer efficiency, and producer efficiency. The results indicate that commodities markets are a better choice for controlling Grid resources than previously defined auction strategies.

IEEE Transactions on Parallel and Distributed Systems | 1998

Diskless checkpointing

James S. Plank; Kai Li; Michael A. Puening

Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures.

network computing and applications | 2006

Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Network Storage Applications

James S. Plank; Lihao Xu

In the past few years, all manner of storage applications, ranging from disk array systems to distributed and wide-area systems, have started to grapple with the reality of tolerating multiple simultaneous failures of storage nodes. Unlike the single failure case, which is optimally handled with RAID level-5 parity, the multiple failure case is more difficult because optimal general purpose strategies are not yet known. Erasure coding is the field of research that deals with these strategies, and this field has blossomed in recent years. Despite this research, the decades-old Reed-Solomon erasure code remains the only space-optimal (MDS) code for all but the smallest storage systems. The best performing implementations of Reed-Solomon coding employ a variant called Cauchy Reed-Solomon coding, developed in the mid 1990s. In this paper, we present an improvement to Cauchy Reed-Solomon coding that is based on optimizing the Cauchy distribution matrix. We detail an algorithm for generating good matrices and then evaluate the performance of encoding using all implementations Reed-Solomon codes, plus the best MDS codes from the literature. The improvements over the original Cauchy Reed-Solomon codes are as much as 83% in realistic scenarios, and average roughly 10% over all cases that we tested

IEEE Transactions on Parallel and Distributed Systems | 1994

Low-latency, concurrent checkpointing for parallel programs

Kai Li; Jeffrey F. Naughton; James S. Plank

Presents the results of an implementation of several algorithms for checkpointing and restarting parallel programs on shared-memory multiprocessors. The algorithms are compared according to the metrics of overall checkpointing time, overhead imposed by the checkpointer on the target program, and amount of time during which the checkpointer interrupts the target program. The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed. >

acm special interest group on data communication | 2002

An end-to-end approach to globally scalable network storage

Micah Beck; Terry Moore; James S. Plank

This paper discusses the application of end-to-end design principles, which are characteristic of the architecture of the Internet, to network storage. While putting storage into the network fabric may seem to contradict end-to-end arguments, we try to show not only that there is no contradiction, but also that adherence to such an approach is the key to achieving true scalability of shared network storage. After discussing end-to-end arguments with respect to several properties of network storage, we describe the Internet Backplane Protocol and the exNode, which are tools that have been designed to create a network storage substrate that adheres to these principles. The name for this approach is Logistical Networking, and we believe its use is fundamental to the future of truly scalable communication.

ieee international symposium on fault tolerant computing | 1998

Experimental assessment of workstation failures and their impact on checkpointing systems

James S. Plank; Wael R. Elwasif

In the past twenty years, there has been a wealth of theoretical research on minimizing the expected running time of a program in the presence of failures by employing checkpointing and rollback recovery. In the same time period, there has been little experimental research to corroborate these results. We study three separate projects that monitor failure in workstation networks. Our goals are twofold. The first is to see how these results correlate with the theoretical results, and the second is to assess their impact on strategies for checkpointing long-running computations on workstations and networks of workstations. A significant result of our work is that although the base assumptions of the theoretical research do not hold, many of the results are still applicable.

dependable systems and networks | 2004

A practical analysis of low-density parity-check erasure codes for wide-area storage applications

James S. Plank; Michael G. Thomason

As peer-to-peer and widely distributed storage systems proliferate, the need to perform efficient erasure coding, instead of replication, is crucial to performance and efficiency. Low-density parity-check (LDPC) codes have arisen as alternatives to standard erasure codes, such as Reed-Solomon codes, trading off vastly improved decoding performance for inefficiencies in the amount of data that must be acquired to perform decoding. The scores of papers written on LDPC codes typically analyze their collective and asymptotic behavior. Unfortunately, their practical application requires the generation and analysis of individual codes for finite systems. This paper attempts to illuminate the practical considerations of LDPC codes for peer-to-peer and distributed storage systems. The three main types of LDPC codes are detailed, and a huge variety of codes are generated, then analyzed using simulation. This analysis focuses on the performance of individual codes for finite systems, and addresses several important heretofore unanswered questions about employing LDPC codes in real-world systems.

acm sigplan symposium on principles and practice of parallel programming | 1990

Real-time, concurrent checkpoint for parallel programs

Kai Li; Jeffrey F. Naughton; James S. Plank

We have developed and implemented a checkpointing and restart algorithm for parallel programs running on commercial uniprocessors and shared-memory multiprocessors. The algorithm runs concurrently with the target program, interrupts the target program for small, fixed amounts of time and is transparent to the checkpointed program and its compiler. The algorithm achieves its efficiency through a novel use of address translation hardware that allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed.

conference on high performance computing (supercomputing) | 1997

CLIP: A Checkpointing Tool for Message Passing Parallel Programs

Yuqun Chen; James S. Plank; Kai Li

Checkpointing is a useful technique for rollback recovery. We present CLIP, a user-level library that provides semi-transparent checkpointing for parallel programs on the Intel Paragon multicomputer. Creating an actual tool for checkpointing a complex machine like the Paragon is not easy, because many issues arise that require careful design decisions to be made. We detail what these decisions are, and how they were made in CLIP. We present performance data when checkpointing several long-running parallel applications. These results show that a convenient, general-purpose checkpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer with good performance.

Explore More