Gabriel H. Loh | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gabriel H. Loh is active.

Explore More

Publication

Featured researches published by Gabriel H. Loh.

international symposium on microarchitecture | 2006

Die Stacking (3D) Microarchitecture

Bryan Black; Murali Annavaram; Ned Brekelbaum; John P. Devale; Lei Jiang; Gabriel H. Loh; Don McCaule; Patrick Morrow; Donald W. Nelson; Daniel Pantuso; Paul Reed; Jeff Rupley; Sadasivan Shankar; John Paul Shen; Clair Webb

3D die stacking is an exciting new technology that increases transistor density by vertically integrating two or more die with a dense, high-speed interface. The result of 3D die stacking is a significant reduction of interconnect both within a die and across dies in a system. For instance, blocks within a microprocessor can be placed vertically on multiple die to reduce block to block wire distance, latency, and power. Disparate Si technologies can also be combined in a 3D die stack, such as DRAM stacked on a CPU, resulting in lower power higher BW and lower latency interfaces, without concern for technology integration into a single process flow. 3D has the potential to change processor design constraints by providing substantial power and performance benefits. Despite the promising advantages of 3D, there is significant concern for thermal impact. In this research, we study the performance advantages and thermal challenges of two forms of die stacking: Stacking a large DRAM or SRAM cache on a microprocessor and dividing a traditional micro architecture between two die in a stack

ACM Journal on Emerging Technologies in Computing Systems | 2006

Design space exploration for 3D architectures

Yuan Xie; Gabriel H. Loh; Bryan Black; Kerry Bernstein

As technology scales, interconnects have become a major performance bottleneck and a major source of power consumption for microprocessors. Increasing interconnect costs make it necessary to consider alternate ways of building modern microprocessors. One promising option is 3D architectures where a stack of multiple device layers with direct vertical tunneling through them are put together on the same chip. As fabrication of 3D integrated circuits has become viable, developing CAD tools and architectural techniques is imperative to explore the design space to 3D microarchitectures. In this article, we give a brief introduction to 3D integration technology, discuss the EDA design tools that can enable the adoption of 3D ICs, and present the implementation of various microprocessor components using 3D technology. An industrial case study is presented as an initial attempt to design 3D microarchitectures.

international symposium on microarchitecture | 2007

Processor Design in 3D Die-Stacking Technologies

Gabriel H. Loh; Yuan Xie; Bryan Black

Three-dimensional integration is an emerging fabrication technology that vertically stacks multiple integrated chips. The benefits include an increase in device density; much greater flexibility in routing signals, power, and clock; the ability to integrate disparate technologies; and the potential for new 3D circuit and microarchitecture organizations. This article provides a technical introduction to the technology and its impact on processor design. Although our discussions here primarily focus on high-performance processor design, most of the observations and conclusions apply to other microprocessor market segments.

international symposium on computer architecture | 2010

Use ECP, not ECC, for hard failures in resistive memories

Stuart E. Schechter; Gabriel H. Loh; Karin Strauss; Doug Burger

As leakage and other charge storage limitations begin to impair the scalability of DRAM, non-volatile resistive memories are being developed as a potential replacement. Unfortunately, current error correction techniques are poorly suited to this emerging class of memory technologies. Unlike DRAM, PCM and other resistive memories have wear lifetimes, measured in writes, that are sufficiently short to make cell failures common during a systems lifetime. However, resistive memories are much less susceptible to transient faults than DRAM. The Hamming-based ECC codes used in DRAM are designed to handle transient faults with no effective lifetime limits, but ECC codes applied to resistive memories would wear out faster than the cells they are designed to repair. This paper evaluates Error-Correcting Pointers (ECP), a new approach to error correction optimized for memories in which errors are the result of permanent cell failures that occur, and are immediately detectable, at write time. ECP corrects errors by permanently encoding the locations of failed cells into a table and assigning cells to replace them. ECP provides longer lifetimes than previously proposed solutions with equivalent overhead. Whats more, as the level of variance in cell lifetimes increases -- a likely consequence of further scalaing -- ECPs margin of improvement over existing schemes increases.

international symposium on computer architecture | 2009

PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches

Yuejian Xie; Gabriel H. Loh

Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that sharing-oblivious cache management policies (e.g., LRU) can lead to poor performance and fairness when the multiple cores compete for the limited LLC capacity. Different memory access patterns can cause cache contention in different ways, and various techniques have been proposed to target some of these behaviors. In this work, we propose a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism. By handling multiple types of memory behaviors, our proposed technique outperforms techniques that target only either capacity partitioning or adaptive insertion.

great lakes symposium on vlsi | 2006

Thermal analysis of a 3D die-stacked high-performance microprocessor

Kiran Puttaswamy; Gabriel H. Loh

3-dimensional integrated circuit (3D IC) technology places circuit blocks in the vertical dimension in addition to the conventional horizontal plane. Compared to conventional planar ICs, 3D ICs have shorter latencies as well as lower power consumption, due to shorter wires. The benefits of 3D ICs increase as we stack more die, due to successive reductions in wire lengths. However, as we stack more die, the power density increases due to increasing proximity of active (heat generating) devices, thus causing the temperatures to increase. Also, the topmost die on the 3D stack are located further from the heat sink and experience a longer heat dissipation path. Prior research has already identified thermal management as a critical issue in 3D technology. In this paper, we evaluate the thermal impact of building high-performance microprocessors in 3D. We estimate the temperatures of a planar IC based on the Alpha 21364 processor as well as 2-die and 4-die 3D implementations of the same. We show that, compared to the planar IC, the 2-die implementation and 4-die implementation increase the maximum temperature by 17 Kelvin and 33 Kelvin, respectively.

high-performance computer architecture | 2007

Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors

Kiran Puttaswamy; Gabriel H. Loh

3D integration technology greatly increases transistor density while providing faster on-chip communication. 3D implementations of processors can simultaneously provide both latency and power benefits due to reductions in critical wires. However, 3D stacking of active devices can potentially exacerbate existing thermal problems. In this work, we propose a family of thermal herding techniques that (1) reduces 3D power density and (2) locates a majority of the power on the top die closest to the heat sink. Our 3D/thermal-aware microarchitecture contributions include a significance-partitioned datapath that places the frequently switching 16-bits on the top die, a 3D-aware instruction scheduler allocation scheme, an address memorization approach for the load and store queues, a partial value encoding for the L1 data cache, and a branch target buffer that exploits a form of frequent partial value locality in target addresses. Compared to a conventional planar processor, our 3D processor achieves a 47.9% frequency increase which results in a 47.0% performance improvement (min 7%, max 77% on individual benchmarks), while simultaneously reducing total power by 20% (min 15%, max 30%). Without our thermal herding techniques, the worst-case 3D temperature increases by 17 degrees. With our thermal herding techniques, the temperature increase is only 12 degrees (29% reduction in the 3D worst-case temperature increase)

international solid-state circuits conference | 2012

3D-MAPS: 3D Massively parallel processor with stacked memory

Dae Hyun Kim; Krit Athikulwongse; Michael B. Healy; Mohammad M. Hossain; Moongon Jung; Ilya Khorosh; Gokul Kumar; Young-Joon Lee; Dean L. Lewis; Tzu-Wei Lin; Chang Liu; Shreepad Panth; Mohit Pathak; Minzhen Ren; Guanhao Shen; Taigon Song; Dong Hyuk Woo; Xin Zhao; Joungho Kim; Ho Choi; Gabriel H. Loh; Hsien-Hsin S. Lee; Sung Kyu Lim

Several recent works have demonstrated the benefits of through-silicon-via (TSV) based 3D integration, but none of them involves a fully functioning multicore processor and memory stacking. 3D-MAPS (3D Massively Parallel Processor with Stacked Memory) is a two-tier 3D IC, where the logic die consists of 64 general-purpose processor cores running at 277MHz, and the memory die contains 256KB SRAM. Fabrication is done using 130nm GlobalFoundries device technology and Tezzaron TSV and bonding technology. Packaging is done by Amkor. This processor contains 33M transistors, 50K TSVs, and 50K face-to-face connections in 5x5mm2 footprint. The chip runs at 1.5V and consumes up to 4W, resulting in 16W/cm2 power density. The core architecture is developed from scratch to benefit from single-cycle access to SRAM.

international conference on computer design | 2005

Implementing caches in a 3D technology for high performance processors

Kiran Puttaswamy; Gabriel H. Loh

3D integration is an emergent technology that has the potential to greatly increase device density while simultaneously providing faster on-chip communication. 3D fabrication involves stacking two or more die connected with a very high-density and low-latency interface. The die-to-die vias that comprise this interface can be treated like regular on-chip metal due to their small size (on the order of l/spl mu/m) and high speed (sub-F04 die-to-die communication delay). The increased device density and the ability to place and route in the third dimension provide new opportunities for microarchitecture design. In this paper, we first present a brief overview of 3D integration technology. We then focus on the design of on-chip caches using 3D integration. In particular, we show that the dense die-to-die vias enable caches that are 3D-partitioned at the level of individual wordlines or bitlines. This results in a wire length reduction within SRAM arrays, and a reduction in the footprint of individual SRAM banks, which reduces the global routing from the edge of the cache to the banks and back. The wire length reduction provides both power and performance benefits, e.g., 21.5% latency reduction and 30.9% energy reduction for a 512KB cache. We also report that implementing only the caches in 3D, without accounting for possible benefits from implementing other components of the processor in 3D, results in a 12% IPC gain. These results demonstrate some of the potential of this new technology, and motivate further research in 3D microarchitectures.

international symposium on performance analysis of systems and software | 2009

Zesto: A cycle-level simulator for highly detailed microarchitecture exploration

Gabriel H. Loh; Samantika Subramaniam; Yuejian Xie

For academic computer architecture research, a large number of publicly available simulators make use of relatively simple abstractions for the microarchitecture of the processor pipeline. For some types of studies, such as those for multi-core cache coherence designs, a simple pipeline model may suffice. For detailed microarchitecture research, such as those that are sensitive to the exact behavior of out-of-order scheduling, ALU and bypass network contention, and resource management (e.g., RS and ROB entries), an over-simplified model is not representative of modern processor organizations. We present a new timing simulator that models a modern x86 microarchitecture at a very low level, including out-of-order scheduling and execution that much more closely mirrors current implementations, a detailed cache/memory hierarchy, as well as many x86-specific microarchitecture features (e.g., simple vs. complex decoders, micro-op decomposition and fusion, microcode lookup overhead for long/complex x86 instructions).

Explore More