Ron Brightwell
Sandia National Laboratories
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ron Brightwell.
ieee international conference on high performance computing data and analytics | 2008
Kurt B. Ferreira; Patrick G. Bridges; Ron Brightwell
Operating system noise has been shown to be a key limiter of application scalability in high-end systems. While several studies have attempted to quantify the sources and effects of system interference using user-level mechanisms, there are few published studies on the effect of different kinds of kernel-generated noise on application performance at scale. In this paper, we examine the sensitivity of real-world, large-scale applications to a range of OS noise patterns using a kernel-based noise injection mechanism implemented in the Catamount lightweight kernel. Our results demonstrate the importance of how noise is generated, in terms of frequency and duration, and how this impact changes with application scale. For example, our results show that 2.5% net processor noise at 10,000 nodes can have no impact or can result in over a factor of 20 slowdown for the same application, depending solely on how the noise is generated. We also discuss how the characteristics of the applications we studied, for example computation/communication ratios, collective communication sizes, and other characteristics, related to their tendency to amplify or absorb noise. Finally, we discuss the implications of our findings on the design of new operating systems, middleware, and other system services for high-end parallel systems.
international parallel and distributed processing symposium | 2010
John R. Lange; Kevin Pedretti; Trammell Hudson; Peter A. Dinda; Zheng Cui; Lei Xia; Patrick G. Bridges; Andy Gocke; Steven Jaconette; Michael J. Levenhagen; Ron Brightwell
Palacios is a new open-source VMM under development at Northwestern University and the University of New Mexico that enables applications executing in a virtualized environment to achieve scalable high performance on large machines. Palacios functions as a modularized extension to Kitten, a high performance operating system being developed at Sandia National Laboratories to support large-scale supercomputing applications. Together, Palacios and Kitten provide a thin layer over the hardware to support full-featured virtualized environments alongside Kittens lightweight native environment. Palacios supports existing, unmodified applications and operating systems by using the hardware virtualization technologies in recent AMD and Intel processors. Additionally, Palacios leverages Kittens simple memory management scheme to enable low-overhead pass-through of native devices to a virtualized environment. We describe the design, implementation, and integration of Palacios and Kitten. Our benchmarks show that Palacios provides near native (within 5%), scalable performance for virtualized environments running important parallel applications. This new architecture provides an incremental path for applications to use supercomputers, running specialized lightweight host operating systems, that is not significantly performance-compromised.
international parallel and distributed processing symposium | 2002
Ron Brightwell; Rolf Riesen; Bill Lawry; Arthur B. Maccabe
This paper describes the evolution of the Portals message passing architecture and programming interface from its initial development on tightly-coupled massively parallel platforms to the current implementation running on a 1792-node commodity PC Linux cluster. Portals provides the basic building blocks needed for higher-level protocols to implement scalable, low-overhead communication. Portals has several unique characteristics that differentiate it from other high-performance system-area data movement layers. This paper discusses several of these features and illustrates how they can impact the scalability and performance of higher-level message passing protocols.
parallel computing | 2000
Ron Brightwell; Lee Ann Fisk; David S. Greenberg; Trammell Hudson; Michael J. Levenhagen; Arthur B. Maccabe; Rolf Riesen
The Computational Plant (Cplant) project at Sandia National Laboratories is developing a large-scale, massively parallel computing resource from a cluster of commodity computing and networking components. We are combining the benefits of commodity cluster computing with our expertise in designing, developing, using, and maintaining large-scale, massively parallel processing (MPP) machines. In this paper, we present the design goals of the cluster and an approach to developing a commodity-based computational resource capable of delivering performance comparable to production-level MPP machines. We provide a description of the hardware components of a 96-node Phase I prototype machine and discuss the experiences with the prototype that led to the hardware choices for a 400-node Phase II production machine. We give a detailed description of the management and runtime software components of the cluster and oAer computational performance data as well as performance measurements of functions that are critical to the management of large systems. ” 2000 Elsevier Science B.V. All rights reserved.
international conference on cluster computing | 2002
William Lawry; Christopher Wilson; Arthur B. Maccabe; Ron Brightwell
This paper describes a portable benchmark suite that assesses the ability of cluster networking hardware and software to overlap MPI communication and computation. The Communication Offload MPI-based Benchmark, or COMB, uses two methods to characterize the ability of messages to make progress concurrently, with computational processing on the host processor(s). COMB measures the relationship between MPI communication bandwidth and host CPU availability.
international conference on parallel processing | 2004
Keith D. Underwood; Ron Brightwell
It is well known that traditional microbenchmarks do not fully capture the salient architectural features that impact application performance. Even worse, microbenchmarks that target MPI and the communications subsystem do not accurately represent the way that applications use MPI. For example, traditional MPI latency benchmarks time a ping-pong communication with one send and one receive on each of two nodes. The time to post the receive is never counted as part of the latency. This scenario is not even marginally representative of most applications. Two new microbenchmarks are presented here that analyze network latency in a way that more realistically represents the way that MPI is typically used. These benchmarks are used to evaluate modern high-performance networks, including Quadrics, InfiniBand, and Myrinet.
international parallel and distributed processing symposium | 2004
Ron Brightwell; Keith D. Underwood
Summary form only given. Modern cluster interconnection networks rely on processing on the network interface to deliver higher bandwidth and lower latency than what could be achieved otherwise. These processors are relatively slow, but they provide adequate capabilities to accelerate some portion of the protocol stack in a cluster computing environment. This offload capability is conceptually appealing, but the standard evaluation of NIC-based protocol implementations relies on simplistic microbenchmarks that create idealized usage scenarios. We evaluate characteristics of MPI usage scenarios using application benchmarks to help define the parameter space that protocol offload implementations should target. Specifically, we analyze characteristics that we expect to have an impact on NIC resource allocation and management strategies, including the length of the MPI posted receive and unexpected message queues, the number of entries in these queues that are examined for a typical operation, and the number of unexpected and expected messages.
international conference on supercomputing | 2004
Ron Brightwell; Keith D. Underwood
The overlap of computation and communication has long been considered to be a significant performance benefit for applications. Similarly, the ability of MPI to make independent progress (that is, to make progress on outstanding communication operations while not in the MPI library) is also believed to yield performance benefits. Using an intelligent network interface to offload the work required to support overlap and independent progress is thought to be an ideal solution, but the benefits of this approach have been poorly studied at the application level. This lack of analysis is complicated by the fact that most MPI implementations do not sufficiently support overlap or independent progress. Recent work has demonstrated a quantifiable advantage for an MPI implementation that uses offload to provide overlap and independent progress. This paper extends this previous work by further qualifying the source of the performance advantage (offload, overlap, or independent progress).
conference on high performance computing (supercomputing) | 1997
David S. Greenberg; Ron Brightwell; Lee Ann Fisk; Arthur Maccabe; Rolf Riesen
MPP systems can neither solve Grand Challenge scientific problems nor enable large-scale industrial and governmental simulations if they rely on extensions to workstation system software. We present a new system architecture used at Sandia. Highest performance is achieved through a lightweight applications interface to a collection of processing nodes. Usability is provided by creating node partitions specialized for user access, networking, and I/O. The system is glued together by a data movement interface called portals. Portals allow data to flow between processing nodes with minimal system overhead while maintaining a suitable degree of protection and reconfigurability.
ieee international conference on high performance computing data and analytics | 2008
Ron Brightwell; Kevin Pedretti; Trammell Hudson
This paper describes SMARTMAP, an operating system technique that implements fixed offset virtual memory addressing. SMARTMAP allows the application processes on a multi-core processor to directly access each others memory without the overhead of kernel involvement. When used to implement MPI, SMARTMAP eliminates all extraneous memory-to-memory copies imposed by UNIX-based shared memory strategies. In addition, SMARTMAP can easily support operations that UNIX-based shared memory cannot, such as direct, in-place MPI reduction operations and one-sided get/put operations. We have implemented SMARTMAP in the Catamount lightweight kernel for the Cray XT and modified MPI and Cray SHMEM libraries to use it. Micro-benchmark performance results show that SMARTMAP allows for significant improvements in latency, bandwidth, and small message rate on a quad-core processor.