Stephen W. Poole | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Stephen W. Poole is active.

Explore More

Publication

Featured researches published by Stephen W. Poole.

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model | 2010

Introducing OpenSHMEM: SHMEM for the PGAS community

Barbara M. Chapman; Tony Curtis; Swaroop Pophale; Stephen W. Poole; Jeffery A. Kuehn; Chuck Koelbel; Lauren Smith

The OpenSHMEM community would like to announce a new effort to standardize SHMEM, a communications library that uses one-sided communication and utilizes a partitioned global address space. OpenSHMEM is an effort to bring together a variety of SHMEM and SHMEM-like implementations into an open standard using a community-driven model. By creating an open-source specification and reference implementation of OpenSHMEM, there will be a wider availability of a PGAS library model on current and future architectures. In addition, the availability of an OpenSHMEM model will enable the development of performance and validation tools. We propose an OpenSHMEM specification to help tie together a number of divergent implementations of SHMEM that are currently available. To support an existing and growing user community, we will develop the OpenSHMEM web presence, including a community wiki and training material, and face-to-face interaction, including workshops and conference participation.

Journal of Computational Physics | 2010

Acceleration of the Smith-Waterman algorithm using single and multiple graphics processors

Ali Khajeh-Saeed; Stephen W. Poole; J. Blair Perot

Finding regions of similarity between two very long data streams is a computationally intensive problem referred to as sequence alignment. Alignment algorithms must allow for imperfect sequence matching with different starting locations and some gaps and errors between the two data sequences. Perhaps the most well known application of sequence matching is the testing of DNA or protein sequences against genome databases. The Smith-Waterman algorithm is a method for precisely characterizing how well two sequences can be aligned and for determining the optimal alignment of those two sequences. Like many applications in computational science, the Smith-Waterman algorithm is constrained by the memory access speed and can be accelerated significantly by using graphics processors (GPUs) as the compute engine. In this work we show that effective use of the GPU requires a novel reformulation of the Smith-Waterman algorithm. The performance of this new version of the algorithm is demonstrated using the SSCA#1 (Bioinformatics) benchmark running on one GPU and on up to four GPUs executing in parallel. The results indicate that for large problems a single GPU is up to 45 times faster than a CPU for this application, and the parallel implementation shows linear speed up on up to 4GPUs.

international conference on green computing | 2010

Energy-efficient application-aware online provisioning for virtualized clouds and data centers

Ivan Rodero; Juan Jaramillo; Andres Quiroz; Manish Parashar; Francesc Guim; Stephen W. Poole

As energy efficiency and associated costs become key concerns, consolidated and virtualized data centers and clouds are attractive computing platforms for data- and compute- intensive applications. These platforms provide an abstraction of nearly-unlimited computing resources through the elastic use of pools of consolidated resources, and provide opportunities for higher utilization and energy savings. Recently, these platforms are also being considered for more traditional high-performance computing (HPC) applications that have typically targeted Grids and similar conventional HPC platforms. However, maximizing energy efficiency, cost-effectiveness, and utilization for these applications while ensuring performance and other Quality of Service (QoS) guarantees, requires leveraging important and extremely challenging tradeoffs. These include, for example, the tradeoff between the need to efficiently create and provision Virtual Machines (VMs) on data center resources and the need to accommodate the heterogeneous resource demands and runtimes of these applications. In this paper we present an energy-aware online provisioning approach for HPC applications on consolidated and virtualized computing platforms. Energy efficiency is achieved using a workload-aware, just-right dynamic provisioning mechanism and the ability to power down subsystems of a host system that are not required by the VMs mapped to it. We evaluate the presented approach using real HPC workload traces from widely distributed production systems. The results presented demonstrated that compared to typical reactive or predefined provisioning, our approach achieves significant improvements in energy efficiency with an acceptable QoS penalty.

ieee high performance extreme computing conference | 2013

Standards for graph algorithm primitives

Tim Mattson; David A. Bader; Jonathan W. Berry; Aydin Buluç; Jack J. Dongarra; Christos Faloutsos; John Feo; John R. Gilbert; Joseph E. Gonzalez; Bruce Hendrickson; Jeremy Kepner; Charles E. Leiserson; Andrew Lumsdaine; David A. Padua; Stephen W. Poole; Steven P. Reinhardt; Michael Stonebraker; Steve Wallach; Andrew Yoo

It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.

international symposium on performance analysis of systems and software | 2011

Power signature analysis of the SPECpower_ssj2008 benchmark

Chung-Hsing Hsu; Stephen W. Poole

As the power consumption of a server system becomes a mainstream concern in enterprise environments, understanding the systems power behavior at varying utilization levels provides us a key to select appropriate energy-efficiency optimizations. In this work, we present an in-depth analysis of 177 SPECpower_ssj2008 results published between 2007–2010 to understand the changes of servers power behavior over time. In particular, we identified simple nonlinear functions appropriate for modeling the power behavior of todays, aggressively power-managed, machines. We consider this work as an important first step towards developing capability for power signature analysis of a high-end computer system.

international conference on parallel processing | 2011

Reducing energy usage with memory and computation-aware dynamic frequency scaling

Michael A. Laurenzano; Mitesh R. Meswani; Laura Carrington; Allan Snavely; Mustafa M. Tikir; Stephen W. Poole

Over the life of a modern supercomputer, the energy cost of running the system can exceed the cost of the original hardware purchase. This has driven the community to attempt to understand and minimize energy costs wherever possible. Towards these ends, we present an automated, fine-grained approach to selecting per-loop processor clock frequencies. The clock frequency selection criteria is established through a combination of lightweight static analysis and runtime tracing that automatically acquires application signatures - characterizations of the patterns of execution of each loop in an application. This application characterization is matched with one of a series of benchmark loops, which have been run on the target system and probe it in various ways. These benchmarks form a covering set, a machine characterization of the expected power consumption and performance traits of the machine over the space of execution patterns and clock frequencies. The frequency that confers the optimal behavior in terms of power-delay product for the benchmark that most closely resembles each application loop is the one chosen for that loop. The set of tools that implement this scheme is fully automated, built on top of freely available open source software, and uses an inexpensive power measurement apparatus. We use these tools to show a measured, system-wide energy savings of up to 7.6% on an 8-core Intel Xeon E5530 and 10.6% on a 32-core AMD Opteron 8380 (a Sun X4600 Node) across a range of workloads.

grid computing | 2010

ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations

Richard L. Graham; Stephen W. Poole; Pavel Shamis; Gil Bloch; Noam Bloch; Hillel Chapman; Michael Kagan; Ariel Shahar; Ishai Rabinovitz; Gilad Shainer

This paper introduces the newly developed Infini- Band (IB) Management Queue capability, used by the Host Channel Adapter (HCA) to manage network task data flow dependancies, and progress the communications associated with such flows. These tasks include sends, receives, and the newly supported wait task, and are scheduled by the HCA based on a data dependency description provided by the user. This functionality is supported by the ConnectX-2 HCA, and provides the means for delegating collective communication management and progress to the HCA, also known as collective communication offload. This provides a means for overlapping collective communications managed by the HCA and computation on the Central Processing Unit (CPU), thus making it possible to reduce the impact of system noise on parallel applications using collective operations. This paper further describes how this new capability can be used to implement scalable Message Passing Interface (MPI) collective operations, describing the high level details of how this new capability is used to implement the MPI Barrier collective operation, focusing on the latency sensitive performance aspects of this new capability. This paper concludes with small scale bench- mark experiments comparing implementations of the barrier collective operation, using the new network offload capabilities, with established point-to-point based implementations of these same algorithms, which manage the data flow using the central processing unit. These early results demonstrate the promise this new capability provides to improve the scalability of high- performance applications using collective communications. The latency of the HCA based implementation of the barrier is similar to that of the best performing point-to-point based implementation managed by the central processing unit, starting to outperform these as the number of processes involved in the collective operation increases.

ieee international symposium on parallel distributed processing workshops and phd forum | 2010

Overlapping computation and communication: Barrier algorithms and ConnectX-2 CORE-Direct capabilities

Richard L. Graham; Stephen W. Poole; Pavel Shamis; Gil Bloch; Noam Bloch; Hillel Chapman; Michael Kagan; Ariel Shahar; Ishai Rabinovitz; Gilad Shainer

This paper explores the computation and communication overlap capabilities enabled by the new CORE-Direct hardware capabilities introduced in the InfiniBand Network Interface Card (NIC) ConnectX-2. We use the latency dominated nonblocking barrier algorithm in this study, and find that at 64 process count, a contiguous time slot of about 80% of the nonblocking barrier time is available for computation. This time slot increases as the number of processes participating increases. In contrast, Central Processing Unit (CPU) based implementations provide a time slot of up to 30% of the nonblocking barrier time. This bodes well for the scalability of simulations employing offloaded collective operations. These capabilities can be used to reduce the effects of system noise, and when using non-blocking collective operations may also be used to hide the effects of application load imbalance.

ieee international conference on high performance computing data and analytics | 2013

Modeling and predicting performance of high performance computing applications on hardware accelerators

Mitesh R. Meswani; Laura Carrington; Didem Unat; Allan Snavely; Scott B. Baden; Stephen W. Poole

Hybrid-core systems speedup applications by offloading certain compute operations that can run faster on hardware accelerators. However, such systems require significant programming and porting effort to gain a performance benefit from the accelerators. Therefore, prior to porting it is prudent to investigate the predicted performance benefit of accelerators for a given workload. To address this problem we present a performance-modeling framework that predicts the application performance rapidly and accurately for hybrid-core systems. We present predictions for two full-scale HPC applications—HYCOM and Milc. Our results for two accelerators (GPU and FPGA) show that gather/scatter and stream operations can speedup by as much as a factor of 15 and overall compute time of Milc and HYCOM improve by 3.4% and 20%, respectively. We also show that in order to benefit from the accelerators, 70% of the latency of data transfer time between the CPU and the accelerators needs to be overcome.

ieee international conference on high performance computing, data, and analytics | 2010

Investigating the potential of application-centric aggressive power management for HPC workloads

Ivan Rodero; Sharat Chandra; Manish Parashar; Rajeev Muralidhar; Harinarayanan Seshadri; Stephen W. Poole

Energy efficiency of large-scale data centers is becoming a major concern not only for reasons of energy conservation, failures, and cost reduction, but also because such sys tems are soon reaching the limits of power available to them. Like High Performance Computing (HPC) systems, large-scale clu ster-based data centers can consume power in megawatts, and of all the power consumed by such a system, only a fraction is used for actual computations. In this paper, we study the potential of application-centric aggressive power management of data centers resources for HPC workloads. Specifically, we consider power management mechanisms and controls (currently or soon to be) available at different levels and for different subsystems, and leverage several innovative approaches that have been taken to tackle this problem in the last few years, can be effectively used in a application-aware manner for HPC workloads. To do this, we first profile sta ndard HPC benchmarks with respect to behaviors, resource usage and power impact on individual computing nodes. Based on a power and latency model and the workload profiles, we develop an algorithm that can improve energy efficiency with little or no performance loss. We then evaluate our proposed algorithm through simulations using empirical power characterization and quantification. Finally, we validate the simulation results with actual executions on real hardware. The obtained results show that by using application aware power management, we can re-du ce the average energy consumption without significant penalty in performance. This motivates us to investigate autonomic approaches for application-aware aggressive power management and cross layer and cross function predictive subsystem level power management for large-scale data centers.

Explore More