Shih-Lien Lu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shih-Lien Lu is active.

Explore More

Publication

Featured researches published by Shih-Lien Lu.

international symposium on microarchitecture | 2007

RAMP: Research Accelerator for Multiple Processors

John Wawrzynek; David A. Patterson; Mark Oskin; Shih-Lien Lu; Christoforos E. Kozyrakis; James C. Hoe; Derek Chiou; Krste Asanovic

The RAMP projects goal is to enable the intensive, multidisciplinary innovation that the computing industry will need to tackle the problems of parallel processing. RAMP itself is an open-source, community-developed, FPGA-based emulator of parallel architectures. its design framework lets a large, collaborative community develop and contribute reusable, composable design modules. three complete designs - for transactional memory, distributed systems, and distributed-shared memory - demonstrate the platforms potential.

international symposium on microarchitecture | 2009

Improving cache lifetime reliability at ultra-low voltages

Zeshan Chishti; Alaa R. Alameldeen; Chris Wilkerson; Wei Wu; Shih-Lien Lu

Voltage scaling is one of the most effective mechanisms to reduce microprocessor power consumption. However, the increased severity of manufacturing-induced parameter variations at lower voltages limits voltage scaling to a minimum voltage, Vccmin, below which a processor cannot operate reliably. Memory cell failures in large memory structures (e.g., caches) typically determine the Vccmin for the whole processor. Memory failures can be persistent (i.e., failures at time zero which cause yield loss) or non-persistent (e.g., soft errors or erratic bit failures). Both types of failures increase as supply voltage decreases and both need to be addressed to achieve reliable operation at low voltages. In this paper, we propose a novel adaptive technique to improve cache lifetime reliability and enable low voltage operation. This technique, multi-bit segmented ECC (MS-ECC) addresses both persistent and non-persistent failures. Like previous work on mitigating persistent failures, MS-ECC trades off cache capacity for lower voltages. However, unlike previous schemes, MS-ECC does not rely on testing to identify and isolate defective bits, and therefore enables error tolerance for non-persistent failures like erratic bits and soft errors at low voltages. Furthermore, MS-ECCs design can allow the operating system to adaptively change the cache size and ECC capability to adjust to system operating conditions. Compared to current designs with single-bit correction, the most aggressive implementation for MS-ECC enables a 30% reduction in supply voltage, reducing power by 71% and energy per instruction by 42%.

international symposium on microarchitecture | 2000

Performance improvement with circuit-level speculation

Tong Liu; Shih-Lien Lu

Current superscalar microprocessors performance depends on its frequency and the number of useful instructions that can be processed per cycle (IPC). In this paper we propose a method called approximation to reduce the logic delay of a pipe-stage. The basic idea of approximation is to implement the logic function partially instead of fully. Most of the time the partial implementation gives the correct result as if the function is implemented fully but with fewer gates delay allowing a higher pipeline frequency. We apply this method on three logic blocks. Simulation results show that this method provides some performance improvement for a wide-issue superscalar if these stages are finely pipelined.

ieee hot chips symposium | 2006

Research accelerator for multiple processors

David A. Patterson; Arvind; Krste Asanovic; Derek Chiou; James C. Hoe; Christos Kozyrakis; Shih-Lien Lu; Mark Oskin; Jan M. Rabaey; John Wawrzynek

This article consists of a collection of slides from the authors conference presentation on RAMP, or research acclerators for multiple processors. Some of the specific topics discussed include: system specifications and architecture; uniprocessor performance capabilities; RAMP hardware and description language features; RAMP applications development; storage capabilities; and future areas of technological development.

design, automation, and test in europe | 2010

Automatic pipelining from transactional datapath specifications

Eriko Nurvitadhi; James C. Hoe; Timothy Kam; Shih-Lien Lu

We present a transactional datapath specification (T-spec) and the tool (T-piper) to synthesize automatically an in-order pipelined implementation from it. T-spec abstractly views a datapath as executing one transaction at a time, computing next system states based on current ones. From a T-spec, T-piper can synthesize a pipelined implementation that preserves original transaction semantics, while allowing simultaneous execution of multiple overlapped transactions across pipeline stages. T-piper not only ensures the correctness of pipelined executions, but can also employ forwarding and speculation to minimize performance loss due to data dependencies. Design case studies on RISC and CISC processor pipeline development are reported.

international symposium on microarchitecture | 2009

Trading Off Cache Capacity for Low-Voltage Operation

Chris Wilkerson; Hongliang Gao; Alaa R. Alameldeen; Zeshan Chishti; Muhammad M. Khellah; Shih-Lien Lu

Two proposed techniques let microprocessors operate at low voltages despite high memory-cell failure rates. They identify and disable defective portions of the cache at two granularities: individual words or pairs of bits. Both techniques use the entire cache during high-voltage operation while sacrificing cache capacity during low-voltage operation to reduce the minimum voltage below 500 mV.

international symposium on microarchitecture | 2015

Improving DRAM latency with dynamic asymmetric subarray

Shih-Lien Lu; Ying-Chen Lin; Chia-Lin Yang

The evolution of DRAM technology has been driven by capacity and bandwidth during the last decade. In contrast, DRAM access latency stays relatively constant and is trending to increase. Much efforts have been devoted to tolerate memory access latency but these techniques have reached the point of diminishing returns. Having shorter bitline and wordline length in a DRAM device will reduce the access latency. However by doing so it will impact the array efficiency. In the mainstream market, manufacturers are not willing to trade capacity for latency. Prior works had proposed hybrid-bitline DRAM design to overcome this problem. However, those methods are either intrusive to the circuit and layout of the DRAM design, or there is no direct way to migrate data between the fast and slow levels. In this paper, we proposed a novel asymmetric DRAM with capability to perform low cost data migration between subarrays. Having this design we determined a simple management mechanism and explored many management related policies. We showed that with this new design and our simple management technique we could achieve 7.25% and 11.77% performance improvement in single- and multi-programming workloads, respectively, over a system with traditional homogeneous DRAM. This gain is above 80% of the potential performance gain of a system based on a hypothetical DRAM which is made out of short bitlines entirely.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2014

DArT: A Component-Based DRAM Area, Power, and Timing Modeling Tool

Hsiu-Chuan Shih; Pei-Wen Luo; Jen-Chieh Yeh; Shu-Yen Lin; Ding-Ming Kwai; Shih-Lien Lu; Andre Schaefer; Cheng-Wen Wu

DRAM renovation calls for a holistic architecture exploration to cope with bandwidth growth and latency reduction need. In this paper, we present DRAM area power timing (DArT), a DRAM area, power, and timing modeling tool, for array assembly and interface customization. Through proper design abstraction, our component-based modeling approach provides increased flexibility and higher accuracy, making DArT suitable for DRAM architecture exploration and performance estimation. We validate the accuracy of DArT with respect to the physical layout and circuit simulation of an industrial 68 nm commodity DRAM device as a reference. The experiment results show that the maximum deviations from the reference design, in terms of area, timing, and power, are 3.2%, 4.92%, and 1.73%, respectively. For an architectural projection by porting it to a 45 nm process, the maximum deviations are 3.4%, 3.42%, and 8.57%, respectively. The combination of modeling performance, flexibility, and accuracy of DArT allows us to easily explore new DRAM architectures in the future, including 3-D stacked DRAM.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2011

Automatic Pipelining From Transactional Datapath Specifications

Eriko Nurvitadhi; James C. Hoe; Timothy Kam; Shih-Lien Lu

This paper presents a transactional specification framework (T-spec) for describing a datapath and the tool T-piper to synthesize automatically an in-order pipelined implementation with arbitrary user-specified pipeline-stage boundaries. T-spec abstractly views a datapath as executing one transaction at a time, computing the next system states based on the current ones. The synthesized pipeline maintains this semantics, yet allows concurrent execution of multiple overlapped transactions in different pipeline stages, where each stage performs a part of the next-state computation of each transaction. T-spec makes the state reading and writing events in a datapath explicit to enable T-piper to perform exact read-after-write (RAW) hazard analysis between the overlapped transactions. T-piper can automatically generate the pipeline control not only to ensure the correctness of the pipelined executions but also to minimize (using forwarding and speculation) the performance loss due to pipeline stalls in the presence of RAW dependencies. This paper reports design case studies applying T-spec and T-piper to reduced instruction set computing and complex instruction set computing processor pipeline development. In the latter, we report the results from a rapid design space exploration of 60 generated x86-subset pipelines, varying in pipeline depth, forwarding, and speculative execution, all starting from a single T-spec.

international symposium on low power electronics and design | 2012

Reducing L1 caches power by exploiting software semantics

Zhen Fang; Li Zhao; Xiaowei Jiang; Shih-Lien Lu; Ravi R. Iyer; Tong Li; Seung Eun Lee

To access a set-associative L1 cache in a high-performance processor, all ways of the selected set are searched and fetched in parallel using physical address bits. Such a cache is oblivious of memory references software semantics such as stack-heap bifurcation of the memory space, and user-kernel ring levels. This constitutes a waste of energy since e.g., a user-mode instruction fetch will never hit a cache block that contains kernel code. Similarly, a stack access will not hit a cacheline that contains heap data.n We propose to exploit software semantics in cache design to avoid unnecessary associative searches, thus reducing dynamic power consumption. Specifically, we utilize virtual memory region properties to optimize the data cache and ring level information to optimize the instruction cache. Our design does not impact performance, and incurs very small hardware cost. Simulations results using SPEC CPU and SPECjapps indicate that the proposed designs help to reduce cache block fetches from DL1 and IL1 by 27% and 57% respectively, resulting in average savings of 15% of DL1 power and more than 30% of IL1 power compared to an aggressively clock-gated baseline.

Explore More