Is this you? Create Your Porfile

Shuaiwen Leon Song

Pacific Northwest National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shuaiwen Leon Song is active.

Explore More

Publication

Featured researches published by Shuaiwen Leon Song.

international conference on supercomputing | 2015

Locality-Driven Dynamic GPU Cache Bypassing

Chao Li; Shuaiwen Leon Song; Hongwen Dai; Albert Sidelnik; Siva Kumar Sastry Hari; Huiyang Zhou

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of simultaneous requests from single-instruction multiple-thread (SIMT) cores makes the limited capacity of L1 D-caches a performance and energy bottleneck, especially for memory-intensive applications. We observe that the memory access streams to L1 D-caches for many applications contain a significant amount of requests with low reuse, which greatly reduce the cache efficacy. Existing GPU cache management schemes are either based on conditional/reactive solutions or hit-rate based designs specifically developed for CPU last level caches, which can limit overall performance. To overcome these challenges, we propose an efficient locality monitoring mechanism to dynamically filter the access stream on cache insertion such that only the data with high reuse and short reuse distances are stored in the L1 D-cache. Specifically, we present a design that integrates locality filtering based on reuse characteristics of GPU workloads into the decoupled tag store of the existing L1 D-cache through simple and cost-effective hardware extensions. Results show that our proposed design can dramatically reduce cache contention and achieve up to 56.8% and an average of 30.3% performance improvement over the baseline architecture, for a range of highly-optimized cache-unfriendly applications with minor area overhead and better energy efficiency. Our design also significantly outperforms the state-of-the-art CPU and GPU bypassing schemes (especially for irregular applications), without generating extra L2 and DRAM level contention.

international parallel and distributed processing symposium | 2014

MIC-SVM: Designing a Highly Efficient Support Vector Machine for Advanced Modern Multi-core and Many-Core Architectures

Yang You; Shuaiwen Leon Song; Haohuan Fu; Andres Marquez; Maryam Mehri Dehnavi; Kevin J. Barker; Kirk W. Cameron; Amanda Randles; Guangwen Yang

Support Vector Machine (SVM) has been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. In recent years, SVM was adapted to the field of High Performance Computing for power/performance prediction, auto-tuning, and runtime scheduling. However, even at the risk of losing prediction accuracy due to insufficient runtime information, researchers can only afford to apply offline model training to avoid significant runtime training overhead. Advanced multi- and many-core architectures offer massive parallelism with complex memory hierarchies which can make runtime training possible, but form a barrier to efficient parallel SVM design. To address the challenges above, we designed and implemented MIC-SVM, a highly efficient parallel SVM for x86 based multi-core and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools. MIC-SVM achieves 4.4-84x and 18-47x speedups against the popular LIBSVM, on MIC and Ivy Bridge CPUs respectively, for several real-world data-mining datasets. Even compared with GPUSVM, run on a top of the line NVIDIA k20x GPU, the performance of our MIC-SVM is competitive. We also conduct a cross-platform performance comparison analysis, focusing on Ivy Bridge CPUs, MIC and GPUs, and provide insights on how to select the most suitable advanced architectures for specific algorithms and input data patterns.

international parallel and distributed processing symposium | 2014

The Power-Performance Tradeoffs of the Intel Xeon Phi on HPC Applications

Bo Li; Hung-Ching Chang; Shuaiwen Leon Song; Chun-Yi Su; Timmy Meyer; John Mooring; Kirk W. Cameron

Accelerators are used in about 13% of the current Top500 List. Supercomputers leveraging accelerators grew by a factor of 2.2x in 2012 and are expected to completely dominate the Top500 by 2015. Though most of these deployments use NVIDIA GPGPU accelerators, Intels Xeon Phi architecture will likely grow in popularity in the coming years. Unfortunately, there are few studies analyzing the performance and energy efficiency of systems leveraging the Intel Xeon Phi. We extend our systemic measurement methodology to isolate system power by component including accelerators. We use this methodology to present a detailed study of the performance-energy tradeoffs of the Xeon Phi architecture. We demonstrate the portability of our approach by comparing our Xeon Phi results to the Intel multicore Sandy Bridge host processor and the NVIDIA Tesla GPU for a wide range of HPC applications. Our results help explain limitations in the power-performance scalability of HPC applications on the current Intel Xeon Phi architecture.

ieee international conference on high performance computing data and analytics | 2015

GraphReduce: processing large-scale graphs on accelerator-based systems

Dipanjan Sengupta; Shuaiwen Leon Song; Kapil Agarwal; Karsten Schwan

Recent work on real-world graph analytics has sought to leverage the massive amount of parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithms and limitations in GPU-resident memory for storing large graphs. We present GraphReduce, a highly efficient and scalable GPU-based framework that operates on graphs that exceed the devices internal memory capacity. GraphReduce adopts a combination of edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model and operates on multiple asynchronous GPU streams to fully exploit the high degrees of parallelism in GPUs with efficient graph data movement between the host and device. GraphReduce-based programming is performed via device functions that include gatherMap, gatherReduce, apply, and scatter, implemented by programmers for the graph algorithms they wish to realize. Extensive experimental evaluations for a wide variety of graph inputs and algorithms demonstrate that GraphReduce significantly outperforms other competing out-of-memory approaches.

international workshop on energy efficient supercomputing | 2013

Unified performance and power modeling of scientific workloads

Shuaiwen Leon Song; Kevin J. Barker; Darren J. Kerbyson

It is expected that scientific applications executing on future large-scale HPC must be optimized not only in terms of performance, but also in terms of power consumption. As power and energy become increasingly constrained resources, researchers and developers must have access to tools that will allow for accurate prediction of both performance and power consumption. Reasoning about performance and power consumption in concert will be critical for achieving maximum utilization of limited resources on future HPC systems. To this end, we present a unified performance and power model for the Nek-Bone mini-application developed as part of the DOEs CESAR Exascale Co-Design Center. Our models consider the impact of computation, point-to-point communication, and collective communication individually and quantitatively predict their impact on both performance and energy efficiency. Further, these models are demonstrated to be accurate on currently available HPC system architectures. In this paper, we present our modeling methodology and performance and power models for the Nek-Bone mini-application. We present validation results that indicate the accuracy of these models.

international parallel and distributed processing symposium | 2015

Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing

Li Tan; Shuaiwen Leon Song; Panruo Wu; Zizhong Chen; Rong Ge; Darren J. Kerbyson

Energy efficiency and resilience are two crucial challenges for HPC systems to reach exactable. While energy efficiency and resilience issues have been extensively studied individually, little has been done to understand the interplay between energy efficiency and resilience for HPC systems. Decreasing the supply voltage associated with a given operating frequency for processors and other CMOS-based components can significantly reduce power consumption. However, this often raises system failure rates and consequently increases application execution time. In this work, we present an energy saving undervaluing approach that leverages the mainstream resilience techniques to tolerate the increased failures caused by undervaluing. Our strategy is directed by analytic models, which capture the impact of undervaluing and the interplay between energy efficiency and resilience. Experimental results on a power-aware cluster demonstrate that our approach can save up to 12.1% energy compared to the baseline, and conserve up to 9.1% more energy than a state-of-the-art DVFS solution.

Journal of Parallel and Distributed Computing | 2015

Scaling Support Vector Machines on modern HPC platforms

Yang You; Haohuan Fu; Shuaiwen Leon Song; Amanda Randles; Darren J. Kerbyson; Andres Marquez; Guangwen Yang; Adolfy Hoisie

Support Vector Machines (SVM) have been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. In recent years, SVM was adapted to the field of High Performance Computing for power/performance prediction, auto-tuning, and runtime scheduling. However, even at the risk of losing prediction accuracy due to insufficient runtime information, researchers can only afford to apply offline model training to avoid significant runtime training overhead. Advanced multi- and many-core architectures offer massive parallelism with complex memory hierarchies which can make runtime training possible, but form a barrier to efficient parallel SVM design.To address the challenges above, we designed and implemented MIC-SVM, a highly efficient parallel SVM for?x86 based multi-core and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools.MIC-SVM achieves 4.4-84i??and 18-47i??speedups against the popular LIBSVM, on MIC and Ivy Bridge CPUs respectively, for several real-world data-mining datasets. Even compared with GPUSVM, running on the NVIDIA k20x?GPU, the performance of our MIC-SVM is competitive. We also conduct a cross-platform performance comparison analysis, focusing on Ivy Bridge CPUs, MIC and GPUs, and provide insights on how to select the most suitable advanced architectures for specific algorithms and input data patterns. An efficient parallel support vector machine for?x86 based multi-core platforms.The novel optimization techniques to fully utilize the multi-level parallelism.The improvement for the deficiencies of the current SVM tools.Select the best architectures for input data patterns to achieve best performance.The large-scale distributed algorithm and power-efficient approach.

ieee international conference on high performance computing data and analytics | 2014

Evaluating multi-core and many-core architectures through accelerating the three-dimensional Lax-Wendroff correction stencil

Yang You; Haohuan Fu; Shuaiwen Leon Song; Maryam Mehri Dehnavi; Lin Gan; Xiaomeng Huang; Guangwen Yang

Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time-consuming, which greatly limits their performance and power efficiency. In this paper, we accelerate the forward-modeling technique on the latest multi-core and many-core architectures such as Intel® Sandy Bridge CPUs, NVIDIA Fermi C2070 GPUs, NVIDIA Kepler K20× GPUs, and the Intel® Xeon Phi co-processor. For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels. For Sandy Bridge CPUs and MIC, we also employ various optimization techniques in order to achieve the best performance. Although our stencil with 114 component variables poses several great challenges for performance optimization, and the low stencil ratio between computation and memory access is too inefficient to fully take advantage of our evaluated architectures, we manage to achieve performance efficiencies ranging from 4.730% to 20.02% of the theoretical peak. We also conduct cross-platform performance and power analysis (focusing on Kepler GPU and MIC) and the results could serve as insights for users selecting the most suitable accelerators for their targeted applications.

high performance distributed computing | 2016

New-Sum: A Novel Online ABFT Scheme For General Iterative Methods

Dingwen Tao; Shuaiwen Leon Song; Sriram Krishnamoorthy; Panruo Wu; Xin Liang; Eddy Z. Zhang; Darren J. Kerbyson; Zizhong Chen

Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme. These designs are capable of addressing scenarios under different error rates. Our ABFT approaches apply to a wide range of iterative solvers that primarily rely on matrix-vector multiplication and vector linear operations. We evaluate our designs through comprehensive analytical and empirical analysis. Experimental evaluation on the Stampede supercomputer demonstrates the low performance overheads incurred by our two ABFT schemes for preconditioned CG (0.4% and 2.2%) and preconditioned BiCGSTAB (1.0% and 4.0%) for the largest SPD matrix from UFL Sparse Matrix Collection. The evaluation also demonstrates the flexibility and effectiveness of our proposed designs for detecting and recovering various types of soft errors in general iterative methods.

architectural support for programming languages and operating systems | 2017

Locality-Aware CTA Clustering for Modern GPUs

Ang Li; Shuaiwen Leon Song; Weifeng Liu; Xu Liu; Akash Kumar; Henk Corporaal

Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-shared L2 with long access latency; while the in-core locality, which is crucial for performance delivery, is handled explicitly by user-controlled scratchpad memory. In this work, we disclose another type of data locality that has been long ignored but with performance boosting potential --- the inter-CTA locality. Exploiting such locality is rather challenging due to unclear hardware feasibility, unknown and inaccessible underlying CTA scheduler, and small in-core cache capacity. To address these issues, we first conduct a thorough empirical exploration on various modern GPUs and demonstrate that inter-CTA locality can be harvested, both spatially and temporally, on L1 or L1/Tex unified cache. Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse together on the same SM. Our techniques require no hardware modification and can be directly deployed on existing GPUs. In addition, we incorporate these techniques into an integrated framework for automatic inter-CTA locality optimization. We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures. The results show that our proposed techniques significantly improve cache performance through reducing L2 cache transactions by 55%, 65%, 29%, 28% on average for Fermi, Kepler, Maxwell and Pascal, respectively, leading to an average of 1.46x, 1.48x, 1.45x, 1.41x (up to 3.8x, 3.6x, 3.1x, 3.3x) performance speedups for applications with algorithm-related inter-CTA reuse.

Explore More