Is this you? Create Your Porfile

Huiyang Zhou

North Carolina State University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Huiyang Zhou is active.

Explore More

Publication

Featured researches published by Huiyang Zhou.

programming language design and implementation | 2010

A GPGPU compiler for memory optimization and parallelism management

Yi Yang; Ping Xiang; Jingfei Kong; Huiyang Zhou

This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism. The input to our compiler is a naïve GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler analyzes the code, identifies its memory access patterns, and generates both the optimized kernel and the kernel invocation parameters. Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or address-offset insertion for partition-camping elimination. The experiments on a set of scientific and media processing algorithms show that our optimized code achieves very high performance, either superior or very close to the highly fine-tuned library, NVIDIA CUBLAS 2.2, and up to 128 times speedups over the naive versions. Another distinguishing feature of our compiler is the understandability of the optimized code, which is useful for performance analysis and algorithm refinement.

international conference on parallel architectures and compilation techniques | 2001

Adaptive Mode Control: A Static-Power-Efficient Cache Design

Huiyang Zhou; Mark C. Toburen; Eric Rotenberg; Thomas M. Conte

Lower threshold voltages in deep sub-micron technologies cause store leakage current, increasing static power dissipation. This trend, combined with the trend of larger/more cache memories dominating die area, has prompted circuit designers to develop SRAM cells with low-leakage operating modes (e.g., sleep mode). Sleep mode reduces static power dissipation but data stored in a sleeping cell is unreliable or lost. So, at the architecture level, there is interest in exploiting sleep mode to reduce static power dissipation while maintaining high performance. Current approaches dynamically control the operating mode of large groups of cache lines or even individual cache lines. However, the performance monitoring mechanism that controls the percentage of sleep-mode lines, and identifies particular lines for sleep mode, is somewhat arbitrary. There is no way to know what the performance could be with all cache lines active, so arbitrary miss rate targets are set (perhaps on a per-benchmark basis using profile information) and the control mechanism tracks these targets. We propose applying sleep mode only to the data store and not the tag store. By keeping the entire tag store active, the hardware knows what the hypothetical miss rate would be if all data lines were active and the actual miss rate can be made to precisely track it. Simulations show an average of 73% of I-cache lines and 54% of D-cache lines are put in sleep mode with an average IPC impact of only 1.7%, for 64KB caches.

international conference on parallel architectures and compilation techniques | 2005

Dual-core execution: building a highly scalable single-thread instruction window

Huiyang Zhou

Current integration trends embrace the prosperity of single-chip multi-core processors. Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In this paper, we propose a new execution paradigm that utilizes multi-cores on a single chip collaboratively to achieve high performance for single-thread memory-intensive workloads while maintaining the flexibility to support multithreaded applications. The proposed execution paradigm, dual-core execution, consists of two superscalar cores (a front and back processor) coupled with a queue. The front processor fetches and preprocesses instruction streams and retires processed instructions into the queue for the back processor to consume. The front processor executes instructions as usual except for cache-missing loads, which produce an invalid value instead of blocking the pipeline. As a result, the front processor runs far ahead to warm up the data caches and fix branch mispredictions for the back processor. In-flight instructions are distributed in the front processor, the queue, and the back processor, forming a very large instruction window for single-thread out-of-order execution. The proposed architecture incurs only minor hardware changes and does not require any large centralized structures such as large register files, issue queues, load/store queues, or reorder buffers. Experimental results show remarkable latency hiding capabilities of the proposed architecture, even outperforming more complex single-thread processors with much larger instruction windows than the front or back processor.

high-performance computer architecture | 2009

Hardware-software integrated approaches to defend against software cache-based side channel attacks

Jingfei Kong; Onur Aciicmez; Jean-Pierre Seifert; Huiyang Zhou

Software cache-based side channel attacks present serious threats to modern computer systems. Using caches as a side channel, these attacks are able to derive secret keys used in cryptographic operations through legitimate activities. Among existing countermeasures, software solutions are typically application specific and incur substantial performance overhead. Recent hardware proposals including the Partition-Locked cache (PLcache) and Random-Permutation cache (RPcache) [23], although very effective in reducing performance overhead while enhancing the security level, may still be vulnerable to advanced cache attacks. In this paper, we propose three hardware-software approaches to defend against software cache-based attacks - they present different tradeoffs between hardware complexity and performance overhead. First, we propose to use preloading to secure the PLcache. Second, we leverage informing loads, which is a lightweight architectural support originally proposed to improve memory performance, to protect the RPcache. Third, we propose novel software permutation to replace the random permutation hardware in the RPcache. This way, regular caches can be protected with hardware support for informing loads. In our experiments, we analyze various processor models for their vulnerability to cache attacks and demonstrate that even to the processor model that is most vulnerable to cache attacks, our proposed software-hardware integrated schemes provide strong security protection.

acm sigplan symposium on principles and practice of parallel programming | 2014

yaSpMV: yet another SpMV framework on GPUs

Shengen Yan; Chao Li; Yunquan Zhang; Huiyang Zhou

SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs).

high performance computer architecture | 2012

CPU-assisted GPGPU on fused CPU-GPU architectures

Yi Yang; Ping Xiang; Mike Mantor; Huiyang Zhou

This paper presents a novel approach to utilize the CPU resource to facilitate the execution of GPGPU programs on fused CPU-GPU architectures. In our model of fused architectures, the GPU and the CPU are integrated on the same die and share the on-chip L3 cache and off-chip memory, similar to the latest Intel Sandy Bridge and AMD accelerated processing unit (APU) platforms. In our proposed CPU-assisted GPGPU, after the CPU launches a GPU program, it executes a pre-execution program, which is generated automatically from the GPU kernel using our proposed compiler algorithms and contains memory access instructions of the GPU kernel for multiple thread-blocks. The CPU pre-execution program runs ahead of GPU threads because (1) the CPU pre-execution thread only contains memory fetch instructions from GPU kernels and not floating-point computations, and (2) the CPU runs at higher frequencies and exploits higher degrees of instruction-level parallelism than GPU scalar cores. We also leverage the prefetcher at the L2-cache on the CPU side to increase the memory traffic from CPU. As a result, the memory accesses of GPU threads hit in the L3 cache and their latency can be drastically reduced. Since our pre-execution is directly controlled by user-level applications, it enjoys both high accuracy and flexibility. Our experiments on a set of benchmarks show that our proposed pre-execution improves the performance by up to 113% and 21.4% on average.

ACM Transactions in Embedded Computing Systems | 2003

Adaptive mode control: A static-power-efficient cache design

Huiyang Zhou; Mark C. Toburen; Eric Rotenberg; Thomas M. Conte

Lower threshold voltages in deep submicron technologies cause more leakage current, increasing static power dissipation. This trend, combined with the trend of larger/more cache memories dominating die area, has prompted circuit designers to develop SRAM cells with low-leakage operating modes (e.g., sleep mode). Sleep mode reduces static power dissipation, but data stored in a sleeping cell is unreliable or lost. So, at the architecture level, there is interest in exploiting sleep mode to reduce static power dissipation while maintaining high performance.Current approaches dynamically control the operating mode of large groups of cache lines or even individual cache lines. However, the performance monitoring mechanism that controls the percentage of sleep-mode lines, and identifies particular lines for sleep mode, is somewhat arbitrary. There is no way to know what the performance could be with all cache lines active, so arbitrary miss rate targets are set (perhaps on a per-benchmark basis using profile information), and the control mechanism tracks these targets. We propose applying sleep mode only to the data store and not the tag store. By keeping the entire tag store active the hardware knows what the hypothetical miss rate would be if all data lines were active, and the actual miss rate can be made to precisely track it. Simulations show that an average of 73% of I-cache lines and 54% of D-cache lines are put in sleep mode with an average IPC impact of only 1.7%, for 64 KB caches.

IEEE Transactions on Computers | 2005

Enhancing memory-level parallelism via recovery-free value prediction

Huiyang Zhou; Thomas M. Conte

The ever-increasing computational power of contemporary microprocessors reduces the execution time spent on arithmetic computations (i.e., the computations not involving slow memory operations such as cache misses) significantly. Therefore, for memory-intensive workloads, it becomes more important to overlap multiple cache misses than to overlap slow memory operations with other computations. In this paper, we propose a novel technique to parallelize sequential cache misses, thereby increasing memory-level parallelism (MLP). Our idea is based on value prediction, which was proposed originally as an instruction-level parallelism (ILP) optimization to break true data dependencies. In this paper, we advocate value prediction in its capability to enhance MLP instead of ILP. We propose using value prediction and value-speculative execution only for prefetching so that not only the complex prediction validation and misprediction recovery mechanisms are avoided, but better performance can also be achieved for memory-intensive workloads. The minor hardware modifications that are required also enable aggressive memory disambiguation for prefetching. The experimental results show that our technique enhances MLP effectively and achieves significant speedups, even with a simple stride value predictor.

computer and communications security | 2008

Deconstructing new cache designs for thwarting software cache-based side channel attacks

Jingfei Kong; Onur Aciicmez; Jean-Pierre Seifert; Huiyang Zhou

Software cache-based side channel attacks present a serious tthreat to computer systems. Previously proposed countermeasures were either too costly for practical use or only effective against particular attacks. Thus, a recent work identified cache interferences in general as the root cause and proposed two new cache designs, namely partition-locked cache (PLcache) and random permutation cache(RPcache), to defeat cache-based side channel attacks by eliminating/obfuscating cache interferences. In this paper, we analyze these new cache designs and identify significant vulnerabilities and shortcomings of those new cache designs. We also propose possible solutions and improvements over the original new cache designs to overcome the identified shortcomings.

dependable systems and networks | 2010

Improving privacy and lifetime of PCM-based main memory

Jingfei Kong; Huiyang Zhou

Phase change memory (PCM) is a promising technology for computer memory systems. However, the non-volatile nature of PCM poses serious threats to computer privacy. The low programming endurance of PCM devices also limits the lifetime of PCM-based main memory (PRAM). In this paper, we first adopt counter-mode encryption for privacy protection and show that encryption significantly reduces the effectiveness of some previously proposed wear-leveling techniques for PRAM. To mitigate such adverse impact, we propose simple, yet effective extensions to the encryption scheme. In addition, we propose to reuse the encryption counters as age counters and to dynamically adjust the strength of error correction code (ECC) to extend the lifetime of PRAM. Our experiments show that our mechanisms effectively achieve privacy protection and lifetime extension for PRAM with very low performance overhead.

Explore More