Thomas R. W. Scogland

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Thomas R. W. Scogland is active.

Explore More

Publication

Featured researches published by Thomas R. W. Scogland.

international parallel and distributed processing symposium | 2012

Heterogeneous Task Scheduling for Accelerated OpenMP

Thomas R. W. Scogland; Barry Rountree; Wu-chun Feng; Bronis R. de Supinski

Heterogeneous systems with CPUs and computational accelerators such as GPUs, FPGAs or the upcoming Intel MIC are becoming mainstream. In these systems, peak performance includes the performance of not just the CPUs but also all available accelerators. In spite of this fact, the majority of programming models for heterogeneous computing focus on only one of these. With the development of Accelerated Open MP for GPUs, both from PGI and Cray, we have a clear path to extend traditional Open MP applications incrementally to use GPUs. The extensions are geared toward switching from CPU parallelism to GPU parallelism. However they do not preserve the former while adding the latter. Thus computational potential is wasted since either the CPU cores or the GPU cores are left idle. Our goal is to create a runtime system that can intelligently divide an accelerated Open MP region across all available resources automatically. This paper presents our proof-of-concept runtime system for dynamic task scheduling across CPUs and GPUs. Further, we motivate the addition of this system into the proposed Open MP for Accelerators standard. Finally, we show that this option can produce as much as a two-fold performance improvement over using either the CPU or GPU alone.

international conference on parallel and distributed systems | 2011

StreamMR: An Optimized MapReduce Framework for AMD GPUs

Marwa Elteir; Heshan Lin; Wu-chun Feng; Thomas R. W. Scogland

MapReduce is a programming model from Google that facilitates parallel processing on a cluster of thousands of commodity computers. The success of MapReduce in cluster environments has motivated several studies of implementing MapReduce on a graphics processing unit (GPU), but generally focusing on the NVIDIA GPU. Our investigation reveals that the design and mapping of the MapReduce framework needs to be revisited for AMD GPUs due to their notable architectural differences from NVIDIA GPUs. For instance, current state-of-the-art MapReduce implementations employ atomic operations to coordinate the execution of different threads. However, atomic operations can implicitly cause inefficient memory access, and in turn, severely impact performance. In this paper, we propose Streamer, an OpenCL MapReduce framework optimized for AMD GPUs. With efficient atomic-free algorithms for output handling and intermediate result shuffling, Stream MR is superior to atomic-based MapReduce designs and can outperform existing atomic-free MapReduce implementations by nearly five-fold on an AMD Radeon HD 5870.

international parallel and distributed processing symposium | 2009

The Green500 List: Year one

Wu-chun Feng; Thomas R. W. Scogland

The latest release of the Green500 List in November 2008 marked its one-year anniversary. As such, this paper aims to provide an analysis and retrospective examination of the Green500 List in order to understand how the list has evolved and what trends have emerged. In addition, we present community feedback on the Green500 List, particularly from two Green500 birds-of-a-feather (BoF) sessions at the International Supercomputing Conference in June 2008 and SC|08 in November 2008, respectively.

2013 International Green Computing Conference Proceedings | 2013

Trends in energy-efficient computing: A perspective from the Green500

Balaji Subramaniam; Winston A. Saunders; Thomas R. W. Scogland; Wu-chun Feng

A recent study shows that computation per kilowatt-hour has doubled every 1.57 years, akin to Moores Law. While this trend is encouraging, its implications to high-performance computing (HPC) are not yet clear. For instance, DARPAs target of a 20-MW exaflop system will require a 56.8-fold performance improvement with only a 2.4-fold increase in power consumption, which seems unachievable in light of the above trend. To provide a more comprehensive perspective, we analyze current trends in energy efficiency from the Green500 and project expectations for the near future. Specifically, we first provide an analysis of energy efficiency trends in HPC systems from the Green500. We then model and forecast the energy efficiency of future HPC systems. Next, we present exascalar - a holistic metric to measure the distance from the exaflop goal. Finally, we discuss our efforts to standardize power measurement methodologies in order to provide the community with reliable and accurate efficiency data.

Journal of Molecular Graphics & Modelling | 2010

Accelerating electrostatic surface potential calculation with multi-scale approximation on graphics processing units

Ramu Anandakrishnan; Thomas R. W. Scogland; Andrew T. Fenley; John C. Gordon; Wu-chun Feng; Alexey V. Onufriev

Tools that compute and visualize biomolecular electrostatic surface potential have been used extensively for studying biomolecular function. However, determining the surface potential for large biomolecules on a typical desktop computer can take days or longer using currently available tools and methods. Two commonly used techniques to speed-up these types of electrostatic computations are approximations based on multi-scale coarse-graining and parallelization across multiple processors. This paper demonstrates that for the computation of electrostatic surface potential, these two techniques can be combined to deliver significantly greater speed-up than either one separately, something that is in general not always possible. Specifically, the electrostatic potential computation, using an analytical linearized Poisson-Boltzmann (ALPB) method, is approximated using the hierarchical charge partitioning (HCP) multi-scale method, and parallelized on an ATI Radeon 4870 graphical processing unit (GPU). The implementation delivers a combined 934-fold speed-up for a 476,040 atom viral capsid, compared to an equivalent non-parallel implementation on an Intel E6550 CPU without the approximation. This speed-up is significantly greater than the 42-fold speed-up for the HCP approximation alone or the 182-fold speed-up for the GPU alone.

international conference on parallel and distributed systems | 2011

Architecture-Aware Mapping and Optimization on a 1600-Core GPU

Mayank Daga; Thomas R. W. Scogland; Wu-chun Feng

The graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-dimensional problem that requires deep technical knowledge of GPU architecture. Although substantial literature exists on how to map and optimize GPU performance on the more mature NVIDIA CUDA architecture, the converse is true for OpenCL on an AMD GPU, such as the 1600-core AMD Radeon HD 5870 GPU. Consequently, we present and evaluate architecture-aware mapping and optimizations for the AMD GPU. The most prominent of which include (i) explicit use of registers, (ii) use of vector types, (iii) removal of branches, and (iv) use of image memory for global data. We demonstrate the efficacy of our AMD GPU mapping and optimizations by applying each in isolation as well as in concert to a large-scale, molecular modeling application called GEM. Via these AMD-specific GPU optimizations, our optimized OpenCL implementation on an AMD Radeon HD 5870 delivers more than a four-fold improvement in performance over the basic OpenCL implementation. In addition, it outperforms our optimized CUDA version on an NVIDIA GTX280 by 12%. Overall, we achieve a speedup of 371-fold over a serial but hand-tuned SSE version of our molecular modeling application, and in turn, a 46-fold speedup over an ideal scaling on an 8-core CPU.

international conference on performance engineering | 2014

A power-measurement methodology for large-scale, high-performance computing

Thomas R. W. Scogland; Craig P. Steffen; Torsten Wilde; Florent Parent; Susan Coghlan; Natalie J. Bates; Wu-chun Feng; Erich Strohmaier

Improvement in the energy efficiency of supercomputers can be accelerated by improving the quality and comparability of efficiency measurements. The ability to generate accurate measurements at extreme scale are just now emerging. The realization of system-level measurement capabilities can be accelerated with a commonly adopted and high quality measurement methodology for use while running a workload, typically a benchmark. This paper describes a methodology that has been developed collaboratively through the Energy Efficient HPC Working Group to support architectural analysis and comparative measurements for rankings, such as the Top500 and Green500. To support measurements with varying amounts of effort and equipment required we present three distinct levels of measurement, which provide increasing levels of accuracy. Level 1 is similar to the Green500 run rules today, a single average power measurement extrapolated from a subset of a machine. Level 2 is more comprehensive, but still widely achievable. Level 3 is the most rigorous of the three methodologies but is only possible at a few sites. However, the Level 3 methodology generates a high quality result that exposes details that the other methodologies may miss. In addition, we present case studies from the Leibniz Supercomputing Centre (LRZ), Argonne National Laboratory (ANL) and Calcul Québec Université Laval that explore the benefits and difficulties of gathering high quality, system-level measurements on large-scale machines.

Computer Science - Research and Development | 2010

A first look at integrated GPUs for green high-performance computing

Thomas R. W. Scogland; Heshan Lin; Wu-chun Feng

The graphics processing unit (GPU) has evolved from a single-purpose graphics accelerator to a tool that can greatly accelerate the performance of high-performance computing (HPC) applications. Previous studies have shown that discrete GPUs, while energy efficient for compute-intensive scientific applications, consume very high power. In fact, a compute-capable discrete GPU can draw more than 200 watts by itself, which can be as much as an entire compute node (without a GPU). This massive power draw presents a serious roadblock to the adoption of GPUs in low-power environments, such as embedded systems. Even when being considered for data centers, the power draw of a GPU presents a problem as it increases the demand placed on support infrastructure such as cooling and available supplies of power, driving up cost. With the advent of compute-capable integrated GPUs with power consumption in the tens of watts, we believe it is time to re-evaluate the notion of GPUs being power-hungry.In this paper, we present the first evaluation of the energy efficiency of integrated GPUs for green HPC. We make use of four specific workloads, each representative of a different computational dwarf, and evaluate them across three different platforms: a multicore system, a high-performance discrete GPU, and a low-power integrated GPU. We find that the integrated GPU delivers superior energy savings and a comparable energy-delay product (EDP) when compared to its discrete counterpart, and it can still outperform the CPUs of a multicore system at a fraction of the power.

international parallel and distributed processing symposium | 2009

Multi-dimensional characterization of temporal data mining on graphics processors

Jeremy S. Archuleta; Yong Cao; Thomas R. W. Scogland; Wu-chun Feng

Through the algorithmic design patterns of data parallelism and task parallelism, the graphics processing unit (GPU) offers the potential to vastly accelerate discovery and innovation across a multitude of disciplines. For example, the exponential growth in data volume now presents an obstacle for high-throughput data mining in fields such as neuroscience and bioinformatics. As such, we present a characterization of a MapReduced-based data-mining application on a general-purpose GPU (GPGPU). Using neuroscience as the application vehicle, the results of our multi-dimensional performance evaluation show that a “one-size-fits-all” approach maps poorly across different GPGPU cards. Rather, a high-performance implementation on the GPGPU should factor in the 1) problem size, 2) type of GPU, 3) type of algorithm, and 4) data-access method when determining the type and level of parallelism. To guide the GPGPU programmer towards optimal performance within such a broad design space, we provide eight general performance characterizations of our data-mining application.

international conference on supercomputing | 2014

CoreTSAR: Adaptive Worksharing for Heterogeneous Systems

Thomas R. W. Scogland; Wu-chun Feng; Barry Rountree; Bronis R. de Supinski

The popularity of heterogeneous computing continues to increase rapidly due to the high peak performance, favorable energy efficiency, and comparatively low cost of accelerators. However, heterogeneous programming models still lack the flexibility of their CPU-only counterparts. Accelerated OpenMP models, including OpenMP 4.0 and OpenACC, ease the migration of code from CPUs to GPUs but lack much of OpenMPs flexibility: OpenMP applications can run on any number of CPUs without extra user effort, but GPU implementations do not offer similar adaptive worksharing across GPUs in a node, nor do they employ a mix of CPUs and GPUs. To address these shortcomings, we present CoreTSAR, our library for scheduling core s via a t ask- s ize a dapting r untime system by supporting worksharing of loop nests across arbitrary heterogeneous resources. Beyond scheduling the computational load across devices, CoreTSAR includes a memory-management system that operates based on task association, enabling the runtime to dynamically manage memory movement and task granularity. Our evaluation shows that CoreTSAR can provide nearly linear scaling to four GPUs and all cores in a node without modifying the code within the parallel region. Furthermore, CoreTSAR provides portable performance across a variety of system configurations.

Explore More