Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Richard A. Golding.
symposium on reliable distributed systems | 1995
Darrell D. E. Long; Andrew Muir; Richard A. Golding
An accurate estimate of host reliability is important for correct analysis of many fault-tolerance and replication mechanisms. In a previous study, we estimated host system reliability by querying a large number of hosts to find how long they had been functioning, estimating the mean time-to-failure (MTTF) and availability from those measures, and in turn deriving an estimate of the mean time-to-repair (MTTR). However, this approach had a bias towards more reliable hosts that could result in overestimating MTTR and underestimating availability. To address this bias we have conducted a second experiment using a fault-tolerant replicated monitoring tool. This tool directly measures TTF, TTR, and availability by polling many sites frequently from several locations. We find that these more accurate results generally confirm and improve our earlier estimates, particularly for TTR. We also find that failure and repair are unlikely to follow Poisson processes.
european conference on computer systems | 2008
Anna Povzner; Tim Kaldewey; Scott A. Brandt; Richard A. Golding; Theodore M. Wong; Carlos Maltzahn
Guaranteed I/O performance is needed for a variety of applications ranging from real-time data collection to desktop multimedia to large-scale scientific simulations. Reservations on throughput, the standard measure of disk performance, fail to effectively manage disk performance due to the orders of magnitude difference between best-, average-, and worst-case response times, allowing reservation of less than 0.01% of the achievable bandwidth. We show that by reserving disk resources in terms of utilization it is possible to create a disk scheduler that supports reservation of nearly 100% of the disk resources, provides arbitrarily hard or soft guarantees depending upon application needs, and yields efficiency as good or better than best-effort disk schedulers tuned for performance. We present the architecture of our scheduler, prove the correctness of its algorithms, and provide results demonstrating its effectiveness.
real time technology and applications symposium | 2006
Theodore M. Wong; Richard A. Golding; Caixue Lin; Ralph A. Becker-Szendy
Large-scale storage systems often hold data for multiple applications and users. A problem in such systems is isolating applications and users from each other to prevent their workloads from interacting in unexpected ways. Another is ensuring that each application receives an appropriate level of performance. As part of the solution to these problems, we have designed a hierarchical I/O scheduling algorithm to manage performance resources on an underlying storage device. Our algorithm uses a simple allocation abstraction: an application or user has a corresponding pool of throughput, and manages throughput within its pool by opening sessions. The algorithm ensures that each pool and session receives at least a reserve rate of throughput and caps usage at a limit rate, using hierarchical token buckets and EDF I/O scheduling. Once it has fulfilled the reserves of all active sessions and pools, it shares unused throughput fairly among active sessions and pools such that they tend to receive the same amount. It thus combines deadline scheduling with proportional-style resource sharing in a novel way. We assume that the device performs its own low-level head scheduling, rather than modeling the device in detail. Our implementation shows the correctness of our algorithm, imposes little overhead on the system, and achieves throughput nearly equal to that of an unmanaged device.
real time technology and applications symposium | 2008
Tim Kaldewey; T.M. Wong; Richard A. Golding; Anna Povzner; S. Brand; Carlos Maltzahn
Large- and small-scale storage systems frequently serve a mixture of workloads, an increasing number of which require some form of performance guarantee. Providing guaranteed disk performance - the equivalent of a virtual disk - is challenging because disk requests are non-preemptible and their execution times are stateful, partially non-deterministic, and can vary by orders of magnitude. Guaranteeing throughput, the standard measure of disk performance, requires worst-case I/O time assumptions orders of magnitude greater than average I/O times, with correspondingly low performance and poor control of the resource allocation. We show that disk time utilization- analogous to CPU utilization in CPU scheduling and the only fully provisionable aspect of disk performance - yields greater control, more efficient use of disk resources, and better isolation between request streams than bandwidth or I/O rate when used as the basis for disk reservation and scheduling.
ieee conference on mass storage systems and technologies | 2007
Kristal T. Pollack; Darrell D. E. Long; Richard A. Golding; Ralph A. Becker-Szendy; Benjamin Reed
Storage systems manage quotas to ensure that no one user can use more than their share of storage, and that each user gets the storage they need. This is difficult for large, distributed systems, especially those used for high- performance computing applications, because resource allocation occurs on many nodes concurrently. While quota management is an important problem, no robust scalable solutions have been proposed to date. We present a solution that has less than 0.2% performance overhead while the system is below saturation, compared with not enforcing quota at all. It provides byte-level accuracy at all times, in the absence of failures and cheating. If nodes fail or cheat, we recover within a bounded period. In our scheme quota is enforced asynchronously by intelligent storage servers: storage clients contact a shared management service to obtain vouchers, which the clients can spend like cash at participating storage servers to allocate storage space. Like a digital cash system, the system periodically reconciles voucher usage to ensure that clients do not cheat by spending the same voucher at multiple storage servers. We report on a simulation study that validates this approach and evaluates its performance.
IEEE Transactions on Dependable and Secure Computing | 2011
Kk Rao; James Lee Hafner; Richard A. Golding
High-end enterprise storage has traditionally consisted of monolithic systems with customized hardware, multiple redundant components and paths, and no single point of failure. Distributed storage systems realized through networked storage nodes offer several advantages over monolithic systems such as lower cost and increased scalability. In order to achieve reliability goals associated with enterprise-class storage systems, redundancy will have to be distributed across the collection of nodes to tolerate both node and drive failures. In this paper, we present alternatives for distributing this redundancy, and models to determine the reliability of such systems. We specify a reliability target and determine the configurations that meet this target. Further, we perform sensitivity analyses, where selected parameters are varied to observe their effect on reliability.
workshop on management of replicated data | 1992
Richard A. Golding
Replicated services can be implemented as process groups. Member processes use group communication protocols to communicate amongst themselves and group membership protocols to determine what processes are in the group. These protocols can provide various levels of consistency between members. The author investigates weak consistency protocols that guarantee that messages are delivered to all members, but do not guarantee when. He reports on a new family of communication protocols, an associated group membership mechanism, and current progress in evaluating their efficiency and utility for real applications.<<ETX>>
petascale data storage workshop | 2007
David O. Bigelow; Suresh Iyer; Tim Kaldewey; Roberto C. Pineiro; Anna Povzner; Scott A. Brandt; Richard A. Golding; Theodore M. Wong; Carlos Maltzahn
Many applications---for example, scientific simulation, real-time data acquisition, and distributed reservation systems---have I/O performance requirements, yet most large, distributed storage systems lack the ability to guarantee I/O performance. We are working on end-to-end performance management in scalable, distributed storage systems. The kinds of storage systems we are targeting include large high-performance computing (HPC) clusters, which require both large data volumes and high I/O rates, as well as large-scale general-purpose storage systems.
international conference on data engineering | 1992
Richard A. Golding; Darrell D. E. Long
A family of communication protocols, called quorum multicasts, is presented that provides efficient communication services for widely replicated data. Quorum multicasts are similar to ordinary multicasts, which deliver a message to a set of destinations. The protocols extend this model by allowing delivery to a subset of the destinations, selected according to distance or expected data currency. These protocols provide well-defined failure semantics, and can distinguish between communication failure and replica failure with high probability. The authors have evaluated their performance, taking measurements of communication latency and failure in the Internet. A simulation study of quorum multicasts showed that they provide low latency and require few messages. A second study that measured a test application running at several sites confirmed these results.<<ETX>>
Archive | 2005
Richard A. Golding; Theodore M. Wong; Omer Ahmed Zaki