Shuaiwen Song | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shuaiwen Song is active.

Explore More

Publication

Featured researches published by Shuaiwen Song.

IEEE Transactions on Parallel and Distributed Systems | 2010

PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications

Rong Ge; Xizhou Feng; Shuaiwen Song; Hung-Ching Chang; Dong Li; Kirk W. Cameron

Energy efficiency is a major concern in modern high-performance computing system design. In the past few years, there has been mounting evidence that power usage limits system scale and computing density, and thus, ultimately system performance. However, despite the impact of power and energy on the computer systems community, few studies provide insight to where and how power is consumed on high-performance systems and applications. In previous work, we designed a framework called PowerPack that was the first tool to isolate the power consumption of devices including disks, memory, NICs, and processors in a high-performance cluster and correlate these measurements to application functions. In this work, we extend our framework to support systems with multicore, multiprocessor-based nodes, and then provide in-depth analyses of the energy consumption of parallel applications on clusters of these systems. These analyses include the impacts of chip multiprocessing on power and energy efficiency, and its interaction with application executions. In addition, we use PowerPack to study the power dynamics and energy efficiencies of dynamic voltage and frequency scaling (DVFS) techniques on clusters. Our experiments reveal conclusively how intelligent DVFS scheduling can enhance system energy efficiency while maintaining performance.

international parallel and distributed processing symposium | 2013

A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures

Shuaiwen Song; Chun-Yi Su; Barry Rountree; Kirk W. Cameron

Emergent heterogeneous systems must be optimized for both power and performance at exascale. Massive parallelism combined with complex memory hierarchies form a barrier to efficient application and architecture design. These challenges are exacerbated with GPUs as parallelism increases orders of magnitude and power consumption can easily double. Models have been proposed to isolate power and performance bottlenecks and identify their root causes. However, no current models combine simplicity, accuracy, and support for emergent GPU architectures (e.g. NVIDIA Fermi). We combine hardware performance counter data with machine learning and advanced analytics to model power-performance efficiency for modern GPU-based systems. Our performance counter based approach is simpler than previous approaches and does not require detailed understanding of the underlying architecture. The resulting model is accurate for predicting power (within 2.1%) and performance (within 6.7%) for application kernels on modern GPUs. Our model can identify power-performance bottlenecks and their root causes for various complex computation and memory access patterns (e.g. global, shared, texture). We measure the accuracy of our power and performance models on a NVIDIA Fermi C2075 GPU for more than a dozen CUDA applications. We show our power model is more accurate and robust than the best available GPU power models - multiple linear regression models MLR and MLR+. We demonstrate how to use our models to identify power-performance bottlenecks and suggest optimization strategies for high-performance codes such as GEM, a biomolecular electrostatic analysis application. We verify our power-performance model is accurate on clusters of NVIDIA Fermi M2090s and useful for suggesting optimal runtime configurations on the Keeneland supercomputer at Georgia Tech.

international parallel and distributed processing symposium | 2011

Iso-Energy-Efficiency: An Approach to Power-Constrained Parallel Computation

Shuaiwen Song; Chun-Yi Su; Rong Ge; Abhinav Vishnu; Kirk W. Cameron

Future large scale high performance supercomputer systems require high energy efficiency to achieve exaflops computational power and beyond. Despite the need to understand energy efficiency in high-performance systems, there are few techniques to evaluate energy efficiency at scale. In this paper, we propose a system-level iso-energy-efficiency model to analyze, evaluate and predict energy-performance of data intensive parallel applications with various execution patterns running on large scale power-aware clusters. Our analytical model can help users explore the effects of machine and application dependent characteristics on system energy efficiency and isolate efficient ways to scale system parameters (e.g. processor count, CPU power/frequency, workload size and network bandwidth) to balance energy use and performance. We derive our iso-energy-efficiency model and apply it to the NAS Parallel Benchmarks on two power-aware clusters. Our results indicate that the model accurately predicts total system energy consumption within 5\% error on average for parallel applications with various execution and communication patterns. We demonstrate effective use of the model for various application contexts and in scalability decision-making.

green computing and communications | 2010

Designing Energy Efficient Communication Runtime Systems for Data Centric Programming Models

Abhinav Vishnu; Shuaiwen Song; Andres Marquez; Kevin J. Barker; Darren J. Kerbyson; Kirk W. Cameron; Pavan Balaji

The insatiable demand of high performance computing is being driven by the most computationally intensive applications such as computational chemistry, climate modeling, nuclear physics, etc. The last couple of decades have observed a tremendous rise in supercomputers with architectures ranging from traditional clusters to system-on-a-chip in order to achieve the petaflop computing barrier. However, with advent of petaflop-plus computing, we have ushered in an era where power efficient system software stack is imperative for execution on exascale systems and beyond. At the same time, computationally intensive applications are exploring programming models beyond traditional message passing, as a combination of Partitioned Global Address Space (PGAS) languages and libraries, providing one-sided communication paradigm with put, get and accumulate primitives. To support the PGAS models, it is critical to design power efficient and high performance one-sided communication runtime systems. In this paper, we design and implement PASCoL, a high performance power aware one-sided communication library using Aggregate Remote Memory Copy Interface (ARMCI), the communication runtime system of Global Arrays. For various communication primitives provided by ARMCI, we study the impact of Dynamic Voltage/Frequency Scaling (DVFS) and a combination of interrupt (blocking)/polling based mechanisms provided by most modern interconnects. We implement our design and evaluate it with synthetic benchmarks using an Infini Band cluster. Our results indicate that PASCoL can achieve significant reduction in energy consumed per byte transfer without additional penalty for various one-sided communication primitives and various message sizes and data transfer patterns.

The Journal of Supercomputing | 2013

Designing energy efficient communication runtime systems: a view from PGAS models

Abhinav Vishnu; Shuaiwen Song; Andres Marquez; Kevin J. Barker; Darren J. Kerbyson; Kirk W. Cameron; Pavan Balaji

As the march to the exascale computing gains momentum, energy consumption of supercomputers has emerged to be the critical roadblock. While architectural innovations are imperative in achieving computing of this scale, it is largely dependent on the systems software to leverage the architectural innovations. Parallel applications in many computationally intensive domains have been designed to leverage these supercomputers, with legacy two-sided communication semantics using Message Passing Interface. At the same time, Partitioned Global Address Space Models are being designed which provide global address space abstractions and one-sided communication for exploiting data locality and communication optimizations. PGAS models rely on one-sided communication runtime systems for leveraging high-speed networks to achieve best possible performance.In this paper, we present a design for Power Aware One-Sided Communication Llibrary – PASCoL. The proposed design detects communication slack, leverages Dynamic Voltage and Frequency Scaling (DVFS), and Interrupt driven execution to exploit the detected slack for energy efficiency. We implement our design and evaluate it using synthetic benchmarks for one-sided communication primitives, Put, Get, and Accumulate and uniformly noncontiguous data transfers. Our performance evaluation indicates that we can achieve significant reduction in energy consumption without performance loss on multiple one-sided communication primitives. The achieved results are close to the theoretical peak available with the experimental test bed.

ieee international conference on high performance computing, data, and analytics | 2010

Fault-tolerant communication runtime support for data-centric programming models

Abhinav Vishnu; Hubertus J. J. van Dam; Wibe A. de Jong; Pavan Balaji; Shuaiwen Song

The largest supercomputers in the world today consist of hundreds of thousands of processing cores and many more other hardware components. At such scales, hardware faults are a commonplace, necessitating fault-resilient software systems. While different fault-resilient models are available, most focus on allowing the computational processes to survive faults. On the other hand, we have recently started investigating fault resilience techniques for data-centric programming models such as the partitioned global address space (PGAS) models. The primary difference in data-centric models is the decoupling of computation and data locality. That is, data placement is decoupled from the executing processes, allowing us to view process failure (a physical node hosting a process is dead) separately from data failure (a physical node hosting data is dead). In this paper, we take a first step toward data-centric fault resilience by designing and implementing a fault-resilient, onesided communication runtime framework using Global Arrays and its communication system, ARMCI. The framework consists of a fault-resilient process manager; low-overhead and networkassisted remote-node fault detection module; non-data-moving collective communication primitives; and failure semantics and err or codes for one-sided communication runtime systems. Our performance evaluation indicates that the framework incurs little ov erhead compared to state-of-the-art designs and provides a fundamental framework of fault resiliency for PGAS models.

modeling, analysis, and simulation on computer and telecommunication systems | 2012

Energy-Aware Replica Selection for Data-Intensive Services in Cloud

Bo Li; Shuaiwen Song; Ivona Bezáková; Kirk W. Cameron

With the increasing energy cost in data centers, an energy efficient approach to provide data intensive services in the cloud is highly in demand. This paper solves the energy cost reduction problem of data centers by formulating an energy-aware replica selection problem in order to guide the distribution of workload among data centers. The current popular centralized replica selection approaches address such problem but they lack scalability and are vulnerable to a crash of the central coordinator. Also, they do not take total data center energy cost as the primary optimization target. We propose a simple decentralized replica selection system implemented with two distributed optimization algorithms (consensus-based distributed projected subgradient method and Lagrangian dual decomposition method) to work with clients as a decentralized coordinator. We also compare our energy-aware replica selection approach with the replica selection where a round-robin algorithm is implemented. A prototype of the decentralized replica selection system is designed and developed to collect energy consumption information of data centers. The results show that the total energy cost can be effectively reduced by using our decentralized replica selection system comparing with a round-robin method. It also has low calculation and communication overhead and can be easily adapted to the real world cloud environment.

international conference on parallel architectures and compilation techniques | 2012

System-level power-performance efficiency modeling for emergent GPU architectures

Shuaiwen Song; Kirk W. Cameron

System and application design choices have profound impact on energy efficiency. Consider the rise of GPUs in emergent supercomputer design. Massive parallelism and high throughput offer the potential for improving performance. However, for systems with multiple GPUs power usage can easily triple. Therefore, we need to quantitatively understand power, performance, and their interactive effects to determine when the increased power use of a GPU is warranted. In order to help application and system designers find power-performance bottlenecks and their causes, we need accurate models that balance abstraction and detail on emergent GPU architectures. The best available GPU power models are statisticalor simulation-based. Related statistical models use multiple linear regression (MLR) and hardware counter data to predict GPU power consumption in isolation. These models provide a good balance of abstraction and application/system detail but potentially ignore nonlinear relationships between power and performance. This could lead to inaccuracy for individual kernel power prediction. Related simulation models use emulated performance data combined with detailed architectural information to predict integrated GPU power-performance efficiency. These models provide enough detail to explore architectural design but ignore application/system level information and require architectural schematics and expertise. Furthermore, both statisticaland simulation-based models require significant adaptation to support modern complex GPU architectures with true cache hierarchies. Therefore, we propose separate system-level power and performance models of modern GPU based systems that can be integrated to better understand the relationships between power and performance at scale. Our runtime hardware counter driven approach combines an analytical performance model for GPU based systems with a machine-learning-based artificial neural network (ANN)

Parallel Processing Letters | 2014

Extending PowerPack for Profiling and Analysis of High Performance Accelerator-Based Systems

Bo Li; Hung-Ching Chang; Shuaiwen Song; Chun-Yi Su; Timmy Meyer; John Mooring; Kirk W. Cameron

Accelerators offer a substantial increase in efficiency for high-performance systems offering speedups for computational applications that leverage hardware support for highly-parallel codes. However, the power use of some accelerators exceeds 200 watts at idle which means use at exascale comes at a significant increase in power at a time when we face a power ceiling of about 20 megawatts. Despite the growing domination of accelerator-based systems in the Top500 and Green500 lists of fastest and most efficient supercomputers, there are few detailed studies comparing the power and energy use of common accelerators. In this work, we conduct detailed experimental studies of the power usage and distribution of Xeon-Phi-based systems in comparison to the NVIDIA Tesla and an Intel Sandy Bridge multicore host processor. In contrast to previous work, we focus on separating individual component power and correlating power use to code behavior. Our results help explain the causes of power-performance scalability for a set of HPC applications.

ieee international conference on high performance computing data and analytics | 2009