Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Garth A. Gibson is active.

Publication


Featured researches published by Garth A. Gibson.


ACM Computing Surveys | 1994

RAID: high-performance, reliable secondary storage

Peter M. Chen; Edward K. Lee; Garth A. Gibson; Randy H. Katz; David A. Patterson

Disk arrays were proposed in the 1980s as a way to use parallelism between multiple disks to improve aggregate I/O performance. Today they appear in the product lines of most major computer manufacturers. This article gives a comprehensive overview of disk arrays and provides a framework in which to organize current and future work. First, the article introduces disk technology and reviews the driving forces that have popularized disk arrays: performance and reliability. It discusses the two architectural techniques used in disk arrays: striping across multiple disks to improve performance and redundancy to improve reliability. Next, the article describes seven disk array architectures, called RAID (Redundant Arrays of Inexpensive Disks) levels 0–6 and compares their performance, cost, and reliability. It goes on to discuss advanced research and implementation topics such as refining the basic RAID levels to improve performance and designing algorithms to maintain data consistency. Last, the article describes six disk array prototypes of products and discusses future opportunities for research, with an annotated bibliography disk array-related literature.


dependable systems and networks | 2006

A large-scale study of failures in high-performance computing systems

Bianca Schroeder; Garth A. Gibson

Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-performance computing sites. The first data set has been collected over the past nine years at Los Alamos National Laboratory (LANL) and has recently been made publicly available. It covers 23,000 failures recorded on more than 20 different systems at LANL, mostly large clusters of SMP and NUMA nodes. The second data set has been collected over the period of one year on one large supercomputing system comprising 20 nodes and more than 10,000 processors. We study the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair. We find, for example, that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate. From one system to another, mean repair time varies from less than an hour to more than a day, and repair times are well modeled by a lognormal distribution.


architectural support for programming languages and operating systems | 1998

A cost-effective, high-bandwidth storage architecture

Garth A. Gibson; David F. Nagle; Khalil Amiri; Jeff Butler; Fay W. Chang; Howard Gobioff; Charles Hardin; Erik Riedel; David Rochberg; Jim Zelenka

This paper describes the Network-Attached Secure Disk (NASD) storage architecture, prototype implementations oj NASD drives, array management for our architecture, and three, filesystems built on our prototype. NASD provides scalable storage bandwidth without the cost of servers used primarily, for transferring data from peripheral networks (e.g. SCSI) to client networks (e.g. ethernet). Increasing datuset sizes, new attachment technologies, the convergence of peripheral and interprocessor switched networks, and the increased availability of on-drive transistors motivate and enable this new architecture. NASD is based on four main principles: direct transfer to clients, secure interfaces via cryptographic support, asynchronous non-critical-path oversight, and variably-sized data objects. Measurements of our prototype system show that these services can be cost-effectively integrated into a next generation disk drive ASK. End-to-end measurements of our prototype drive andfilesysterns suggest that NASD cun support conventional distributed filesystems without performance degradation. More importantly, we show scaluble bandwidth for NASD-specialized filesystems. Using a parallel data mining application, NASD drives deliver u linear scaling of 6.2 MB/s per clientdrive pair, tested with up to eight pairs in our lab.


acm special interest group on data communication | 2009

Safe and effective fine-grained TCP retransmissions for datacenter communication

Vijay Vasudevan; Amar Phanishayee; Hiral Shah; Elie Krevat; David G. Andersen; Gregory R. Ganger; Garth A. Gibson; Brian Mueller

This paper presents a practical solution to a problem facing high-fan-in, high-bandwidth synchronized TCP workloads in datacenter Ethernets---the TCP incast problem. In these networks, receivers can experience a drastic reduction in application throughput when simultaneously requesting data from many servers using TCP. Inbound data overfills small switch buffers, leading to TCP timeouts lasting hundreds of milliseconds. For many datacenter workloads that have a barrier synchronization requirement (e.g., filesystem reads and parallel data-intensive queries), throughput is reduced by up to 90%. For latency-sensitive applications, TCP timeouts in the datacenter impose delays of hundreds of milliseconds in networks with round-trip-times in microseconds. Our practical solution uses high-resolution timers to enable microsecond-granularity TCP timeouts. We demonstrate that this technique is effective in avoiding TCP incast collapse in simulation and in real-world experiments. We show that eliminating the minimum retransmission timeout bound is safe for all environments, including the wide-area.


Communications of The ACM | 2000

Network attached storage architecture

Garth A. Gibson; Rodney Van Meter

SAN with Fibre Channel network hardware that has a greater effect on a user’s purchasing decisions. This article is about how emerging technology may blur the network-centric distinction between NAS and SAN. For example, the decreasing specialization of SAN protocols promises SAN-like devices on Ethernet network hardware. Alternatively, the increasing specialization of NAS systems may embed much of the file system into storage devices. For users, it is increasingly worthwhile to investigate networked storage core and emerging technologies. Today, bits stored online on magnetic disks are so inexpensive that users are finding new, previously unaffordable, uses for storage. At Dataquest’s Storage2000 conference last June in Orlando, Fla., IBM reported that online disk storage is now significantly cheaper than paper or film, the dominant traditional information storage media. Not surprisingly, users are adding storage capacity at about 100% per year. Moreover, the rapid growth of e-commerce, with its huge global customer base and easy-to-use, online transactions, has introduced new market requirements, including bursty, unpredictable spurts in capacity, that demand vendors minimize the time from a user’s order to installation of new storage. In our increasingly Internet-dependent business and computing environment, network storage is the computer. NETWORK ATTACHED STORAGE ARCHITECTURE


Journal of Physics: Conference Series | 2007

Understanding Failures in Petascale Computers

Bianca Schroeder; Garth A. Gibson

Withpetascale computers only a year or two away there isa pressing need to anticipate and compensate for a probable increase in failure and application interruption rates. Researchers, designers and integrators have available to them far too little detailed information on the failures and interruptions that even smaller terascale computers experience. The information that is available suggests that application interruptions will become far more common in the coming decade, and the largest applications may surrender large fractions of the computer’s resources to taking checkpoints and restarting from a checkpoint after an interruption. This paper reviews sources of failure information for compute clusters and storage systems, projects failure rates and the corresponding decrease in application effectiveness, and discusses coping strategies such as application-level checkpoint compression and system level process-pairs fault-tolerance for supercomputing. The need for a public repository for detailed failure and interruption records is particularly concerning, as projections from one architectural family of machines to another are widely disputed. To this end, this paper introduces the Computer Failure Data Repository and issues a call for failure history data to publish in it. One of the hardest problems in future high-performance computing (HPC) installations will be avoiding, coping and recovering from failures. The coming PetaFLOPS clusters will require the simultaneous use and control of hundreds of thousands or even millions of processing, storage, and networking elements. With this large number of elements involved, element failure will be frequent, making it increasingly difficult for applications to make forward progress. The success of petascale computing will depend on the ability to provide reliability and availability at scale. While researchers and practitioners have spent decades investigating approaches for avoiding, coping and recovering from failures, the progress in this area has been hindered by the lack of publicly available failure data from real large-scale systems. We have collected and analyzed a number of large data sets on failures in high-performance computing (HPC) systems. These data sets cover node outages in HPC clusters, as well as failures in storage systems. Using these data sets and large scale trends and assumptions commonly applied to future computing systems design, we project onto the potential machines of the next decade our expectations for failure rates, mean time to application interruption, and the consequential application utilization of the full machine, based on checkpoint/restart fault tolerance and the balanced system design method of matching storage bandwidth and memory size to aggregate computing power [14]. Not surprisingly, if the growth in aggregate computing power continues to outstrip the growth in perchip computing power, more and more of the computer’s resources may be spent on conventional fault recovery methods. We envision applications being denied as much as half of the system’s resources in five years, for example. We then discuss alternative actions that may compensate for this unacceptable


measurement and modeling of computer systems | 1997

File server scaling with network-attached secure disks

Garth A. Gibson; David F. Nagle; Khalil Amiri; Fay W. Chang; Eugene Feinberg; Howard Gobioff; Chen Lee; Berend Ozceri; Erik Riedel; David Rochberg; Jim Zelenka

By providing direct data transfer between storage and client, network-attached storage devices have the potential to improve scalability for existing distributed file systems (by removing the server as a bottleneck) and bandwidth for new parallel and distributed file systems (through network striping and more efficient data paths). Together, these advantages influence a large enough fraction of the storage market to make commodity network-attached storage feasible. Realizing the technologys full potential requires careful consideration across a wide range of file system, networking and security issues. This paper contrasts two network-attached storage architectures---(1) Networked SCSI disks (NetSCSI) are network-attached storage devices with minimal changes from the familiar SCSI interface, while (2) Network-Attached Secure Disks (NASD) are drives that support independent client access to drive object services. To estimate the potential performance benefits of these architectures, we develop an analytic model and perform trace-driven replay experiments based on AFS and NFS traces. Our results suggest that NetSCSI can reduce file server load during a burst of NFS or AFS activity by about 30%. With the NASD architecture, server load (during burst activity) can be reduced by a factor of up to five for AFS and up to ten for NFS.


IEEE Transactions on Dependable and Secure Computing | 2010

A Large-Scale Study of Failures in High-Performance Computing Systems

Bianca Schroeder; Garth A. Gibson

Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-performance computing sites. The first data set has been collected over the past nine years at Los Alamos National Laboratory (LANL) and has recently been made publicly available. It covers 23,000 failures recorded on more than 20 different systems at LANL, mostly large clusters of SMP and NUMA nodes. The second data set has been collected over the period of one year on one large supercomputing system comprising 20 nodes and more than 10,000 processors. We study the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair. We find, for example, that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate. From one system to another, mean repair time varies from less than an hour to more than a day, and repair times are well modeled by a lognormal distribution.


ieee international conference on high performance computing data and analytics | 2009

PLFS: a checkpoint filesystem for parallel applications

John M. Bent; Garth A. Gibson; Gary Grider; Ben McClelland; Paul Nowoczynski; James Nunez; Milo Polte; Meghan Wingate

Parallel applications running across thousands of processors must protect themselves from inevitable system failures. Many applications insulate themselves from failures by checkpointing. For many applications, checkpointing into a shared single file is most convenient. With such an approach, the size of writes are often small and not aligned with file system boundaries. Unfortunately for these applications, this preferred data layout results in pathologically poor performance from the underlying file system which is optimized for large, aligned writes to non-shared files. To address this fundamental mismatch, we have developed a virtual parallel log structured file system, PLFS. PLFS remaps an applications preferred data layout into one which is optimized for the underlying file system. Through testing on PanFS, Lustre, and GPFS, we have seen that this layer of indirection and reorganization can reduce checkpoint time by an order of magnitude for several important benchmarks and real applications without any application modification.


ieee computer society international conference | 1989

Introduction to redundant arrays of inexpensive disks (RAID)

David A. Patterson; Peter M. Chen; Garth A. Gibson; Randy H. Katz

The authors discuss various types of RAIDs (redundant arrays of inexpensive disks), a cost-effective option to meet the challenge of exponential growth in the processor and memory speeds. They argue that the size reduction of personal-computer (PC) disks is the key to the success of disk arrays. While large arrays of mainframe processors are possible, it is certainly easier to construct an array from the same number of microprocessors (or PC drives). With advantages in cost-performance, reliability, power consumption, and floor space, the authors expect RAIDs to replace large drives in future I/O systems.<<ETX>>

Collaboration


Dive into the Garth A. Gibson's collaboration.

Top Co-Authors

Avatar

Randy H. Katz

University of California

View shared research outputs
Top Co-Authors

Avatar

Gregory R. Ganger

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Mark Holland

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar

Jim Zelenka

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar

Milo Polte

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Khalil Amiri

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge