Is this you? Create Your Porfile

Chunyue Liu

University of California, Los Angeles

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chunyue Liu is active.

Explore More

Publication

Featured researches published by Chunyue Liu.

international symposium on microarchitecture | 2008

Power reduction of CMP communication networks via RF-interconnects

M-C. Frank Chang; Jason Cong; Adam Kaplan; Chunyue Liu; Mishali Naik; Jagannath Premkumar; Glenn Reinman; Eran Socher; Sai-Wang Tam

As chip multiprocessors scale to a greater number of processing cores, on-chip interconnection networks will experience dramatic increases in both bandwidth demand and power dissipation. Fortunately, promising gains can be realized via integration of radio frequency interconnect (RF-I) through on-chip transmission lines with traditional interconnects implemented with RC wires. While prior work has considered the latency advantage of RF-I, we demonstrate three further advantages of RF-I: (1) RF-I bandwidth can be flexibly allocated to provide an adaptive NoC, (2) RF-I can enable a dramatic power and area reduction by simplification of NoC topology, and (3) RF-I provides natural and efficient support for multicast. In this paper, we propose a novel interconnect design, exploiting dynamic RF-I bandwidth allocation to realize a reconfigurable network-on-chip architecture. We find that our adaptive RF-I architecture on top of a mesh with 4B links can even outperform the baseline with 16B mesh links by about 1%, and reduces NoC power by approximately 65% including the overhead incurred for supporting RF-I.

design, automation, and test in europe | 2012

Dynamically reconfigurable hybrid cache: an energy-efficient last-level cache design

Yu-Ting Chen; Jason Cong; Hui Huang; Bin Liu; Chunyue Liu; Miodrag Potkonjak; Glenn Reinman

The recent development of non-volatile memory (NVM), such as spin-torque transfer magnetoresistive RAM (STT-RAM) and phase-change RAM (PRAM), with the advantage of low leakage and high density, provides an energy-efficient alternative to traditional SRAM in cache systems. We propose a novel reconfigurable hybrid cache architecture (RHC), in which NVM is incorporated in the last-level cache together with SRAM. RHC can be reconfigured by powering on/off SRAM/NVM arrays in a way-based manner. In this work, we discuss both the architecture and circuit design issues for RHC. Furthermore, we provide hardware-based mechanisms to dynamically reconfigure RHC on-the-fly based on the cache demand. Experimental results on a wide range of benchmarks show that the proposed RHC achieves an average 63%, 48% and 25% energy saving over non-reconfigurable SRAM-based cache, non-reconfigurable hybrid cache, and reconfigurable SRAM-based cache, while maintaining the system performance (at most 4% performance overhead).

international symposium on low power electronics and design | 2011

An energy-efficient adaptive hybrid cache

Jason Cong; Karthik Gururaj; Hui Huang; Chunyue Liu; Glenn Reinman; Yi Zou

By reconfiguring part of the cache as software-managed scratchpad memory (SPM), hybrid caches manage to handle both unknown and predictable memory access patterns. However, existing hybrid caches provide a flexible partitioning of cache and SPM without considering adaptation to the run-time cache behavior. Previous cache set balancing techniques are either energy-inefficient or require serial tag and data array access. In this paper an adaptive hybrid cache is proposed to dynamically remap SPM blocks from high-demand cache sets to low-demand cache sets. This achieves 19%, 25%, 18% and 18% energy-runtime-production reductions over four previous representative techniques on a wide range of benchmarks.

design automation conference | 2011

A reuse-aware prefetching scheme for scratchpad memory

Jason Cong; Hui Huang; Chunyue Liu; Yi Zou

Scratchpad memory (SPM) has been utilized as prefetch buffer in embedded systems and parallel architectures to hide memory access latency. However, the impact of reuse pattern on SPM prefetching has not been fully investigated. In this paper we quantify the impact of reuse on SPM prefetching efficiency and propose a reuse-aware SPM prefetching (RASP) scheme. The average performance and energy improvements are 15.9% and 22.0% over cache prefetching, 12.9% and 31.2% over prefetch-only SPM management, 18.5% and 10% over DRDU [1] with SPM prefetching support.

international symposium on low power electronics and design | 2012

Static and dynamic co-optimizations for blocks mapping in hybrid caches

Yu-Ting Chen; Jason Cong; Hui Huang; Chunyue Liu; Raghu Prabhakar; Glenn Reinman

In this paper, a combined static and dynamic scheme is proposed to optimize the block placement for endurance and energy-efficiency in a hybrid SRAM and STT-RAM cache. With the proposed scheme, STT-RAM endurance is maximized while performance is maintained. We use the compiler to provide static hints to guide initial data placement, and use the hardware to correct the hints based on the run-time cache behavior. Experimental results show that the combined scheme improves the endurance by 23.9x and 5.9x compared to pure static and pure dynamic optimizations respectively. Furthermore, the system energy can be reduced by 17% compared to pure dynamic optimization through minimizing STT-RAM writes.

international conference on computer design | 2013

Accelerator-rich CMPs: From concept to real hardware

Yu-Ting Chen; Jason Cong; Mohammad Ali Ghodrat; Muhuan Huang; Chunyue Liu; Bingjun Xiao; Yi Zou

Application-specific accelerators provide 10-100× improvement in power efficiency over general-purpose processors. The accelerator-rich architectures are especially promising. This work discusses a prototype of accelerator-rich CMPs (PARC). During our development of PARC in real hardware, we encountered a set of technical challenges and proposed corresponding solutions. First, we provided system IPs that serve a sea of accelerators to transfer data between userspace and accelerator memories without cache overhead. Second, we designed a dedicated interconnect between accelerators and memories to enable memory sharing. Third, we implemented an accelerator manager to virtualize accelerator resources for users. Finally, we developed an automated flow with a number of IP templates and customizable interfaces to a C-based synthesis flow to enable rapid design and update of PARC. We implemented PARC in a Virtex-6 FPGA chip with integration of platform-specific peripherals and booting of unmodified Linux. Experimental results show that PARC can fully exploit the energy benefits of accelerators at little system overhead.

design automation conference | 2010

ACES: application-specific cycle elimination and splitting for deadlock-free routing on irregular network-on-chip

Jason Cong; Chunyue Liu; Glenn Reinman

Application-specific Network-on-Chip (NoC) in MPSoC designs often requires irregular topology to optimize power and performance. However, efficient deadlock-free routing, which avoids restricting critical routes and also does not significantly increase power for irregular NoC, has remained an open problem until now. In this paper an application-specific cycle elimination and splitting (ACES) method is presented for this problem. Based on the application-specific communication patterns, we propose a scalable algorithm using global optimization to eliminate as much channel dependency cycles as possible while ensuring shortest paths between heavily communicated nodes, and split only the remaining small set of cycles (if any). Experimental results show that compared to prior work, ACES can either reduce the NoC power by 11%~35% while maintaining approximately the same network performance, or improve the network performance by 10%~36% with slight NoC power overhead (-5%~7%) on a wide range of examples.

field-programmable custom computing machines | 2009

Evaluation of Static Analysis Techniques for Fixed-Point Precision Optimization

Jason Cong; Karthik Gururaj; Bin Liu; Chunyue Liu; Zhiru Zhang; Sheng Zhou; Yi Zou

Precision analysis and optimization is very important when transforming a floating-point algorithm into fixed-point hardware implementations. The core analysis techniques are either based on dynamic analysis or static analysis. We believe in static error analysis, as it is the only technique that can guarantee the desired worst-case accuracy. In this paper we study various underlying arithmetic candidates that can be used in static error analysis and compare their computed sensitivities. The approaches studied include Affine Arithmetic(AA), General Interval Arithmetic (GIA) and Automatic Differentiation (Symbolic Arithmetic). Our study shows that symbolic method is preferred for expressions with higher order cancelation. For programs without strong cancelation, any method works fairly well and GIA slightly outperforms others. We also study the impact of program transformations on these arithmetics.

international symposium on low power electronics and design | 2012

BiN: a buffer-in-NUCA scheme for accelerator-rich CMPs

Jason Cong; Mohammad Ali Ghodrat; Michael Gill; Chunyue Liu; Glenn Reinman

As the number of on-chip accelerators grows rapidly to improve power-efficiency, the buffer size required by accelerators drastically increases. Existing solutions allow the accelerators to share a common pool of buffers or/and allocate buffers in cache. In this paper we propose a Buffer-in-NUCA (BiN) scheme with the following contributions: (1) a dynamic interval-based global buffer allocation method to assign shared buffer spaces to accelerators that can best utilize the additional buffer space, and (2) a flexible and low-overhead paged buffer allocation method to limit the impact of buffer fragmentation in a shared buffer, especially when allocating buffers in a non-uniform cache architecture (NUCA) with distributed cache banks. Experimental results show that, when compared to two representative schemes from the prior work, BiN improves performance by 32% and 35% and reduces energy by 12% and 29%, respectively.

high performance embedded architectures and compilers | 2013

Stream arbitration: Towards efficient bandwidth utilization for emerging on-chip interconnects

Chunhua Xiao; M-C. Frank Chang; Jason Cong; Michael Gill; Zhangqin Huang; Chunyue Liu; Glenn Reinman; Hao Wu

Alternative interconnects are attractive for scaling on-chip communication bandwidth in a power-efficient manner. However, efficient utilization of the bandwidth provided by these emerging interconnects still remains an open problem due to the spatial and temporal communication heterogeneity. In this article, a Stream Arbitration scheme is proposed, where at runtime any source can compete for any communication channel of the interconnect to talk to any destination. We apply stream arbitration to radio frequency interconnect (RF-I). Experimental results show that compared to the representative token arbitration scheme, stream arbitration can provide an average 20% performance improvement and 12% power reduction.

Explore More