Is this you? Create Your Porfile

Barry Rountree

Lawrence Livermore National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Barry Rountree is active.

Explore More

Publication

Featured researches published by Barry Rountree.

international conference on supercomputing | 2009

Adagio: making DVS practical for complex HPC applications

Barry Rountree; David K. Lownenthal; Bronis R. de Supinski; Martin Schulz; Vincent W. Freeh; Tyler K. Bletsch

Power and energy are first-order design constraints in high performance computing. Current research using dynamic voltage scaling (DVS) relies on trading increased execution time for energy savings, which is unacceptable for most high performance computing applications. We present Adagio, a novel runtime system that makes DVS practical for complex, real-world scientific applications by incurring only negligible delay while achieving significant energy savings. Adagio improves and extends previous state-of-the-art algorithms by combining the lessons learned from static energy-reducing CPU scheduling with a novel runtime mechanism for slack prediction. We present results using Adagio for two real-world programs, UMT2K and ParaDiS, along with the NAS Parallel Benchmark suite. While requiring no modification to the application source code, Adagio provides total system energy savings of 8% and 20% for UMT2K and ParaDiS, respectively, with less than 1% increase in execution time.

IEEE Transactions on Parallel and Distributed Systems | 2007

Analyzing the Energy-Time Trade-Off in High-Performance Computing Applications

Vincent W. Freeh; David K. Lowenthal; Feng Pan; Nandini Kappiah; Robert Springer; Barry Rountree; Mark Edward Femal

Although users of high-performance computing are most interested in raw performance both energy and power consumption has become critical concerns. One approach to lowering energy and power is to use high-performance cluster nodes that have several power-performance states so that the energy-time trade-off can be dynamically adjusted. This paper analyzes the energy-time trade-off of a wide range of applications-serial and parallel-on a power-scalable cluster. We use a cluster of frequency and voltage-scalable AMD-64 nodes, each equipped with a power meter. We study the effects of memory and communication bottlenecks via direct measurement of time and energy. We also investigate metrics that can, at runtime, predict when each type of bottleneck occurs. Our results show that, for programs that have a memory or communication bottleneck, a power-scalable cluster can save significant energy with only a small time penalty. Furthermore, we find that, for some programs, it is possible to both consume less energy and execute in less time by increasing the number of nodes while reducing the frequency-voltage setting of each node

conference on high performance computing (supercomputing) | 2007

Bounding energy consumption in large-scale MPI programs

Barry Rountree; David K. Lowenthal; Shelby Funk; Vincent W. Freeh; Bronis R. de Supinski; Martin Schulz

Power is now a first-order design constraint in large-scale parallel computing. Used carefully, dynamic voltage scaling can execute parts of a program at a slower CPU speed to achieve energy savings with a relatively small (possibly zero) time delay. However, the problem of when to change frequencies in order to optimize energy savings is NP-complete, which has led to many heuristic energy-saving algorithms. To determine how closely these algorithms approach optimal savings, we developed a system that determines a bound on the energy savings for an application. Our system uses a linear programming solver that takes as inputs the application communication trace and the cluster power characteristics and then outputs a schedule that realizes this bound. We apply our system to three scientific programs, two of which exhibit load imbalance---particle simulation and UMT2K. Results from our bounding technique show particle simulation is more amenable to energy savings than UMT2K.

international conference on supercomputing | 2008

A regression-based approach to scalability prediction

Bradley J. Barnes; Barry Rountree; David K. Lowenthal; Jaxk Reeves; Bronis R. de Supinski; Martin Schulz

Many applied scientific domains are increasingly relying on large-scale parallel computation. Consequently, many large clusters now have thousands of processors. However, the ideal number of processors to use for these scientific applications varies with both the input variables and the machine under consideration, and predicting this processor count is rarely straightforward. Accurate prediction mechanisms would provide many benefits, including improving cluster efficiency and identifying system configuration or hardware issues that impede performance. We explore novel regression-based approaches to predict parallel program scalability. We use several program executions on a small subset of the processors to predict execution time on larger numbers of processors. We compare three different regression-based techniques: one based on execution time only; another that uses per-processor information only; and a third one based on the global critical path. These techniques provide accurate scaling predictions, with median prediction errors between 6.2% and 17.3% for seven applications.

international parallel and distributed processing symposium | 2012

Beyond DVFS: A First Look at Performance under a Hardware-Enforced Power Bound

Barry Rountree; Dong H. Ahn; Bronis R. de Supinski; David K. Lowenthal; Martin Schulz

Dynamic Voltage Frequency Scaling (DVFS) has been the tool of choice for balancing power and performance in high-performance computing (HPC). With the introduction of Intels Sandy Bridge family of processors, researchers now have a far more attractive option: user-specified, dynamic, hardware-enforced processor power bounds. In this paper we provide a first look at this technology in the HPC environment and detail both the opportunities and potential pitfalls of using this technique to control processor power. As part of this evaluation we measure power and performance for single-processor instances of several of the NAS Parallel Benchmarks. Additionally, we focus on the behavior of a single benchmark, MG, under several different power bounds. We quantify the well-known manufacturing variation in processor power efficiency and show that, in the absence of a power bound, this variation has no correlation to performance. We then show that execution under a power bound translates this variation in efficiency into variation in performance.

acm sigplan symposium on principles and practice of parallel programming | 2006

Minimizing execution time in MPI programs on an energy-constrained, power-scalable cluster

Robert Springer; David K. Lowenthal; Barry Rountree; Vincent W. Freeh

Recently, the high-performance computing community has realized that power is a performance-limiting factor. One reason for this is that supercomputing centers have limited power capacity and machines are starting to hit that limit. In addition, the cost of energy has become increasingly significant, and the heat produced by higher-energy components tends to reduce their reliability. One way to reduce power (and therefore energy) requirements is to use high-performance cluster nodes that are frequency- and voltage-scalable (e.g., AMD-64 processors).The problem we address in this paper is: given a target program, a power-scalable cluster, and an upper limit for energy consumption, choose a schedule (number of nodes and CPU frequency) that simultaneously (1) satisfies an external upper limit for energy consumption and (2) minimizes execution time. There are too many schedules for an exhaustive search. Therefore, we find a schedule through a novel combination of performance modeling, performance prediction, and program execution. Using our technique, we are able to find a near-optimal schedule for all of our benchmarks in just a handful of partial program executions.

international conference on supercomputing | 2013

Exploring hardware overprovisioning in power-constrained, high performance computing

Tapasya Patki; David K. Lowenthal; Barry Rountree; Martin Schulz; Bronis R. de Supinski

Most recent research in power-aware supercomputing has focused on making individual nodes more efficient and measuring the results in terms of flops per watt. While this work is vital in order to reach exascale computing at 20 megawatts, there has been a dearth of work that explores efficiency at the whole system level. Traditional approaches in supercomputer design use worst-case power provisioning: the total power allocated to the system is determined by the maximum power draw possible per node. In a world where power is plentiful and nodes are scarce, this solution is optimal. However, as power becomes the limiting factor in supercomputer design, worst-case provisioning becomes a drag on performance. In this paper we demonstrate how a policy of overprovisioning hardware with respect to power combined with intelligent, hardware-enforced power bounds consistently leads to greater performance across a range of standard benchmarks. In particular, leveraging overprovisioning requires that applications use effective configurations; the best configuration depends on application scalability and memory contention. We show that using overprovisioning leads to an average speedup of more than 50% over worst-case provisioning.

international parallel and distributed processing symposium | 2013

A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures

Shuaiwen Song; Chun-Yi Su; Barry Rountree; Kirk W. Cameron

Emergent heterogeneous systems must be optimized for both power and performance at exascale. Massive parallelism combined with complex memory hierarchies form a barrier to efficient application and architecture design. These challenges are exacerbated with GPUs as parallelism increases orders of magnitude and power consumption can easily double. Models have been proposed to isolate power and performance bottlenecks and identify their root causes. However, no current models combine simplicity, accuracy, and support for emergent GPU architectures (e.g. NVIDIA Fermi). We combine hardware performance counter data with machine learning and advanced analytics to model power-performance efficiency for modern GPU-based systems. Our performance counter based approach is simpler than previous approaches and does not require detailed understanding of the underlying architecture. The resulting model is accurate for predicting power (within 2.1%) and performance (within 6.7%) for application kernels on modern GPUs. Our model can identify power-performance bottlenecks and their root causes for various complex computation and memory access patterns (e.g. global, shared, texture). We measure the accuracy of our power and performance models on a NVIDIA Fermi C2075 GPU for more than a dozen CUDA applications. We show our power model is more accurate and robust than the best available GPU power models - multiple linear regression models MLR and MLR+. We demonstrate how to use our models to identify power-performance bottlenecks and suggest optimization strategies for high-performance codes such as GEM, a biomolecular electrostatic analysis application. We verify our power-performance model is accurate on clusters of NVIDIA Fermi M2090s and useful for suggesting optimal runtime configurations on the Keeneland supercomputer at Georgia Tech.

IEEE Transactions on Power Systems | 2015

Applying High Performance Computing to Transmission-Constrained Stochastic Unit Commitment for Renewable Energy Integration

Anthony Papavasiliou; Shmuel S. Oren; Barry Rountree

We present a parallel implementation of Lagrangian relaxation for solving stochastic unit commitment subject to uncertainty in renewable power supply and generator and transmission line failures. We describe a scenario selection algorithm inspired by importance sampling in order to formulate the stochastic unit commitment problem and validate its performance by comparing it to a stochastic formulation with a very large number of scenarios, that we are able to solve through parallelization. We examine the impact of narrowing the duality gap on the performance of stochastic unit commitment and compare it to the impact of increasing the number of scenarios in the model. We report results on the running time of the model and discuss the applicability of the method in an operational setting.

2011 International Green Computing Conference and Workshops | 2011

Practical performance prediction under Dynamic Voltage Frequency Scaling

Barry Rountree; David K. Lowenthal; Martin Schulz; Bronis R. de Supinski

Predicting performance under Dynamic Voltage Frequency Scaling (DVFS) remains an open problem. Current best practice explores available performance counters to serve as input to linear regression models that predict performance. However, the inaccuracies of these models require that large-scale DVFS runtime algorithms predict performance conservatively in order to avoid significant consequences of mispredictions. Recent theoretical work based on interval analysis advocates a more accurate and reliable solution based on a single new performance counter, Leading Loads. In this paper, we evaluate a processor-independent analytic framework for existing performance counters based on this interval analysis model. We begin with an analysis of the counters used in many published models. We then briefly describe the Leading Loads architectural model and describe how we can use Leading Loads Cycles to predict performance under DVFS. We validate this approach for the NAS Parallel Benchmarks and SPEC CPU 2006 benchmarks, demonstrating an order of magnitude improvement in both error and standard deviation compared to the best existing approaches.

Explore More