Hsiang-Yun Cheng | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hsiang-Yun Cheng is active.

Explore More

Publication

Featured researches published by Hsiang-Yun Cheng.

design automation conference | 2015

Core vs. uncore: the heart of darkness

Hsiang-Yun Cheng; Jia Zhan; Jishen Zhao; Yuan Xie; Jack Sampson; Mary Jane Irwin

Even though Moores Law continues to provide increasing transistor counts, the rise of the utilization wall limits the number of transistors that can be powered on and results in a large region of dark silicon. Prior studies have proposed energy-efficient core designs to address the “dark silico” problem. Nevertheless, the research for addressing dark silicon challenges in uncore components, such as shared cache, on-chip interconnect, etc, that contribute significant on-chip power consumption is largely unexplored. In this paper, we first illustrate that the power consumption of uncore components cannot be ignored to meet the chips power constraint. We then introduce techniques to design energy-efficient uncore components, including shared cache and on-chip interconnect. The design challenges and opportunities to exploit 3D techniques and non-volatile memory (NVM) in dark-silicon-aware architecture are also discussed.

international symposium on computer architecture | 2016

LAP: loop-block aware inclusion properties for energy-efficient asymmetric last level caches

Hsiang-Yun Cheng; Jishen Zhao; Jack Sampson; Mary Jane Irwin; Aamer Jaleel; Yu Lu; Yuan Xie

Emerging non-volatile memory (NVM) technologies, such as spin-transfer torque RAM (STT-RAM), are attractive options for replacing or augmenting SRAM in implementing last-level caches (LLCs). However, the asymmetric read/write energy and latency associated with NVM introduces new challenges in designing caches where, in contrast to SRAM, dynamic energy from write operations can be responsible for a larger fraction of total cache energy than leakage. These properties lead to the fact that no single traditional inclusion policy being dominant in terms of LLC energy consumption for asymmetric LLCs. We propose a novel selective inclusion policy, Loop-block-Aware Policy (LAP), to reduce energy consumption in LLCs with asymmetric read/write properties. In order to eliminate redundant writes to the LLC, LAP incorporates advantages from both non-inclusive and exclusive designs to selectively cache only part of upper-level data in the LLC. Results show that LAP outperforms other variants of selective inclusion policies and consumes 20% and 12% less energy than non-inclusive and exclusive STT-RAM-based LLCs, respectively. We extend LAP to a system with SRAM/STT-RAM hybrid LLCs to achieve energy-efficient data placement, reducing the energy consumption by 22% and 15% over non-inclusion and exclusion on average, with average-case performance improvements, small worst-case performance loss, and minimal hardware overheads.

international conference on human-computer interaction | 2010

An analytical model to exploit memory task scheduling

Hsiang-Yun Cheng; Jian Li; Chia-Lin Yang

Memory Wall has been a well-known obstacle to processor performance improvement. The dawn of many-core processors will further exaggerate the problem. As a result, efficient memory task scheduling has been one important means to sustaining the performance growth. In this paper, we first develop an analytical model to capture the essence of on-chip compute and off-chip communication as shown in the stream programming model. It estimates the potential speedup that can be achieved by restricting the number of simultaneous memory tasks to reduce memory bandwidth contention. We then corroborate the analytical model with experimental results from task scheduling on real hardware. Correlation between the analytical and experimental results offers both insight into the benchmarks running on the hardware and opportunities to extend the analytical model. Our results show that restricting the number of simultaneous memory tasks achieves up to 60% performance improvement with a pool of synthetic workloads.

international symposium on low power electronics and design | 2014

EECache: exploiting design choices in energy-efficient last-level caches for chip multiprocessors

Hsiang-Yun Cheng; Matthew Poremba; Narges Shahidi; Ivan Stalev; Mary Jane Irwin; Mahmut T. Kandemir; Jack Sampson; Yuan Xie

Power management for large last-level caches (LLCs) is important in chip-multiprocessors (CMPs), as the leakage power of LLCs accounts for a significant fraction of the limited on-chip power budget. Since not all workloads need the entire cache, portions of a shared LLC can be disabled to save energy. In this paper, we explore different design choices, from circuit-level cache organization to micro-architectural management policies, to propose a low-overhead run-time mechanism for energy reduction in the shared LLC. Results show that our design (EECache) provides 14.1% energy saving at only 1.2% performance degradation on average, with negligible hardware overhead.

international symposium on circuits and systems | 2016

Designs of emerging memory based non-volatile TCAM for Internet-of-Things (IoT) and big-data processing: A 5T2R universal cell

Meng-Fan Chang; Ching-Hao Chuang; Yen-Ning Chiang; Shyh-Shyuan Sheu; Chia-Chen Kuo; Hsiang-Yun Cheng; Jack Sampson; Mary Jane Irwin

Many search engines or filters for the internet-of-things and big-data employ ternary content-addressable-memory (TCAM) to suppress power consumption in the transmission of data between end-devices and servers. Nonvolatile TCAMs (nvTCAM) are designed to achieve zero standby power with smaller area overhead and faster power off/on operations than those found in conventional TCAM+NVM 2-macro schemes. In this paper, we discuss the challenges involved in the design of nvTCAMs and propose a universal 5T2R nvTCAM cell with tolerance for the various R-ratios and write parameters associated with emerging memory devices. A 128×64b-nvTCAM macro was fabricated using HfO ReRAM and a 90nm-CMOS process for concept verification.

ACM Transactions on Architecture and Code Optimization | 2015

EECache: A Comprehensive Study on the Architectural Design for Energy-Efficient Last-Level Caches in Chip Multiprocessors

Hsiang-Yun Cheng; Matt Poremba; Narges Shahidi; Ivan Stalev; Mary Jane Irwin; Mahmut T. Kandemir; Jack Sampson; Yuan Xie

Power management for large last-level caches (LLCs) is important in chip multiprocessors (CMPs), as the leakage power of LLCs accounts for a significant fraction of the limited on-chip power budget. Since not all workloads running on CMPs need the entire cache, portions of a large, shared LLC can be disabled to save energy. In this article, we explore different design choices, from circuit-level cache organization to microarchitectural management policies, to propose a low-overhead runtime mechanism for energy reduction in the large, shared LLC. We first introduce a slice-based cache organization that can shut down parts of the shared LLC with minimal circuit overhead. Based on this slice-based organization, part of the shared LLC can be turned off according to the spatial and temporal cache access behavior captured by low-overhead sampling-based hardware. In order to eliminate the performance penalties caused by flushing data before powering off a cache slice, we propose data migration policies to prevent the loss of useful data in the LLC. Results show that our energy-efficient cache design (EECache) provides 14.1% energy savings at only 1.2% performance degradation and consumes negligible hardware overhead compared to prior work.

international symposium on performance analysis of systems and software | 2017

Analyzing OpenCL 2.0 workloads using a heterogeneous CPU-GPU simulator

Li Wang; Ren-Wei Tsai; Shao-Chung Wang; Kun-Chih Chen; Po-Han Wang; Hsiang-Yun Cheng; Yi-Chung Lee; Sheng-Jie Shu; Chun-Chieh Yang; Min-Yih Hsu; Li-Chen Kan; Chao-Lin Lee; Tzu-Chieh Yu; Rih-Ding Peng; Chia-Lin Yang; Yuan-Shin Hwang; Jenq Kuen Lee; Shiao-Li Tsao; Ming Ouhyoung

Heterogeneous CPU-GPU systems have recently emerged as an energy-efficient computing platform. A robust integrated CPU-GPU simulator is essential to facilitate researches in this direction. While few integrated CPU-GPU simulators are available, similar tools that support OpenCL 2.0, a widely used new standard with promising heterogeneous computing features, are currently missing. In this paper, we extend the existing integrated CPU-GPU simulator, gem5-gpu, to support OpenCL 2.0. In addition, we conduct experiments on the extended simulator to see the impact of new features introduced by OpenCL 2.0. Our OpenCL 2.0 compatible simulator is successfully validated against a state-of-the-art commercial product, and is expected to help boost future studies in heterogeneous CPU-GPU systems.

IEEE Computer Architecture Letters | 2017

Improving GPGPU Performance via Cache Locality Aware Thread Block Scheduling

Li-Jhan Chen; Hsiang-Yun Cheng; Po-Han Wang; Chia-Lin Yang

Modern GPGPUs support the concurrent execution of thousands of threads to provide an energy-efficient platform. However, the massive multi-threading of GPGPUs incurs serious cache contention, as the cache lines brought by one thread can easily be evicted by other threads in the small shared cache. In this paper, we propose a software-hardware cooperative approach that exploits the spatial locality among different thread blocks to better utilize the precious cache capacity. Through dynamic locality estimation and thread block scheduling, we can capture more performance improvement opportunities than prior work that only explores the spatial locality between consecutive thread blocks. Evaluations across diverse GPGPU applications show that, on average, our locality-aware scheduler provides 25 and 9 percent performance improvement over the commonly-employed round-robin scheduler and the state-of-the-art scheduler, respectively.

ACM Transactions on Design Automation of Electronic Systems | 2015

Adaptive Burst-Writes (ABW): Memory Requests Scheduling to Reduce Write-Induced Interference

Hsiang-Yun Cheng; Mary Jane Irwin; Yuan Xie

Main memory latencies have become a major performance bottleneck for chip-multiprocessors (CMPs). Since reads are on the critical path, existing memory controllers prioritize reads over writes. However, writes must be eventually processed when the write queue is full. These writes are serviced in a burst to reduce the bus turnaround delay and increase the row-buffer locality. Unfortunately, a large number of reads may suffer long queuing delay when the burst-writes are serviced. The long write latency of future nonvolatile memory will further exacerbate the long queuing delay of reads during burst-writes. In this article, we propose a run-time mechanism, Adaptive Burst-Writes (ABW), to reduce the queuing delay of reads. Based on the row-buffer hit rate of writes and the arrival rate of reads, we dynamically control the number of writes serviced in a burst to trade off the write service time and the queuing latency of reads. For prompt adjustment, our history-based mechanism further terminates the burst-writes earlier when the row-buffer hit rate of writes in the previous burst-writes is low. As a result, our policy improves system throughput by up to 28% (average 10%) and 43% (average 14%) in CMPs with DRAM-based and PCM-based main memory.

IEEE Internet Computing | 2010