Kiran Puttaswamy | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kiran Puttaswamy is active.

Explore More

Publication

Featured researches published by Kiran Puttaswamy.

great lakes symposium on vlsi | 2006

Thermal analysis of a 3D die-stacked high-performance microprocessor

Kiran Puttaswamy; Gabriel H. Loh

3-dimensional integrated circuit (3D IC) technology places circuit blocks in the vertical dimension in addition to the conventional horizontal plane. Compared to conventional planar ICs, 3D ICs have shorter latencies as well as lower power consumption, due to shorter wires. The benefits of 3D ICs increase as we stack more die, due to successive reductions in wire lengths. However, as we stack more die, the power density increases due to increasing proximity of active (heat generating) devices, thus causing the temperatures to increase. Also, the topmost die on the 3D stack are located further from the heat sink and experience a longer heat dissipation path. Prior research has already identified thermal management as a critical issue in 3D technology. In this paper, we evaluate the thermal impact of building high-performance microprocessors in 3D. We estimate the temperatures of a planar IC based on the Alpha 21364 processor as well as 2-die and 4-die 3D implementations of the same. We show that, compared to the planar IC, the 2-die implementation and 4-die implementation increase the maximum temperature by 17 Kelvin and 33 Kelvin, respectively.

high-performance computer architecture | 2007

Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors

Kiran Puttaswamy; Gabriel H. Loh

3D integration technology greatly increases transistor density while providing faster on-chip communication. 3D implementations of processors can simultaneously provide both latency and power benefits due to reductions in critical wires. However, 3D stacking of active devices can potentially exacerbate existing thermal problems. In this work, we propose a family of thermal herding techniques that (1) reduces 3D power density and (2) locates a majority of the power on the top die closest to the heat sink. Our 3D/thermal-aware microarchitecture contributions include a significance-partitioned datapath that places the frequently switching 16-bits on the top die, a 3D-aware instruction scheduler allocation scheme, an address memorization approach for the load and store queues, a partial value encoding for the L1 data cache, and a branch target buffer that exploits a form of frequent partial value locality in target addresses. Compared to a conventional planar processor, our 3D processor achieves a 47.9% frequency increase which results in a 47.0% performance improvement (min 7%, max 77% on individual benchmarks), while simultaneously reducing total power by 20% (min 15%, max 30%). Without our thermal herding techniques, the worst-case 3D temperature increases by 17 degrees. With our thermal herding techniques, the temperature increase is only 12 degrees (29% reduction in the 3D worst-case temperature increase)

international conference on computer design | 2005

Implementing caches in a 3D technology for high performance processors

Kiran Puttaswamy; Gabriel H. Loh

3D integration is an emergent technology that has the potential to greatly increase device density while simultaneously providing faster on-chip communication. 3D fabrication involves stacking two or more die connected with a very high-density and low-latency interface. The die-to-die vias that comprise this interface can be treated like regular on-chip metal due to their small size (on the order of l/spl mu/m) and high speed (sub-F04 die-to-die communication delay). The increased device density and the ability to place and route in the third dimension provide new opportunities for microarchitecture design. In this paper, we first present a brief overview of 3D integration technology. We then focus on the design of on-chip caches using 3D integration. In particular, we show that the dense die-to-die vias enable caches that are 3D-partitioned at the level of individual wordlines or bitlines. This results in a wire length reduction within SRAM arrays, and a reduction in the footprint of individual SRAM banks, which reduces the global routing from the edge of the cache to the banks and back. The wire length reduction provides both power and performance benefits, e.g., 21.5% latency reduction and 30.9% energy reduction for a 512KB cache. We also report that implementing only the caches in 3D, without accounting for possible benefits from implementing other components of the processor in 3D, results in a 12% IPC gain. These results demonstrate some of the potential of this new technology, and motivate further research in 3D microarchitectures.

IEEE Transactions on Computers | 2009

3D-Integrated SRAM Components for High-Performance Microprocessors

Kiran Puttaswamy; Gabriel H. Loh

3D integration is an emergent technology that has the potential to greatly increase device density while simultaneously providing faster on-chip communication. 3D fabrication involves stacking two or more die connected with a very high density and low-latency interface. The die-to-die vias that comprise this interface can be treated as regular on-chip metal due to their small size (on the order of 1 mum) and high speed (sub-FO4 die-to-die communication delay). The increased device density and the ability to place and route in the third dimension provide new opportunities for microarchitecture design. In this paper, we focus on the 3D-integrated designs of SRAM structures. We show that the dense die-to-die vias enable 3D-integrated SRAM components that are partitioned at the level of individual wordlines or bitlines. This results in a wire length reduction within SRAM arrays, and a reduction in the area footprint, which reduces the wires required for global routing. The wire length reduction provides simultaneous latency and energy reduction benefits, e.g., 47 percent latency reduction and 18 percent energy reduction for a 4 MB 4-die stacked 3D SRAM array. A 3D implementation of a 128-entry multiported SRAM array achieves a 36 percent latency improvement with a simultaneous energy reduction of 55 percent. As planar designs adapt high-performance techniques such as hierarchical wordlines to improve performance, 3D integration provides even larger benefits, making it a desirable technology for high-performance designs. For the 4 MB SRAM array, the 3D-integrated designs provide additional latency reduction benefit over the planar designs when hierarchical wordlines are implemented in both planar and 3D designs.

international symposium on circuits and systems | 2006

The impact of 3-dimensional integration on the design of arithmetic units

Kiran Puttaswamy; Gabriel H. Loh

3-dimensional integration technology stacks multiple die on top of each other with a dense die-to-die interface. This enables a circuit designer to replace long wires with short vertical interconnects, thus reducing wire-related delay and power consumption. In this research, we evaluate the impact of a 3D fabrication technology on the latency and power of arithmetic functional units. Specifically, we study integer adders and shifters as they have very different delay characteristics. An adders critical path latency is dominated by logic/gate delays, while a shifters latency is more greatly affected by wire delay. We demonstrate that the potential benefits of a 3D technology are the greatest when applied to wire-bound circuits. In particular, a barrel shifter implemented in 3D exhibits a 9% reduction in latency with a simultaneous 8% reduction in energy

great lakes symposium on vlsi | 2006

Dynamic instruction schedulers in a 3-dimensional integration technology

Kiran Puttaswamy; Gabriel H. Loh

We present the design of high-performance and energy-efficient dynamic instruction schedulers in a 3-Dimensional integration technology. Based on a previous observation that the critical path latency of a conventional dynamic scheduler is greatly affected by wire delay, we propose 3D-integrated scheduler designs by partitioning a conventional scheduler across multiple vertically-stacked die. The die-stacked organization reduces the lengths of critical wires thus reducing both latency and energy. Our simulation results show that a 20-entry (120-entry) instruction scheduler implemented in a 2-die stack achieves a 9% (19%) reduction in latency with simultaneous energy reduction as compared to a conventional planar design. The benefits are even larger when the instruction scheduler is implemented on a 4-die stack, with the corresponding latency reductions being 12% (32%).

compilers, architecture, and synthesis for embedded systems | 2001

The emerging power crisis in embedded processors: what can a poor compiler do?

Lakshmi N. Chakrapani; Vincent John Mooney; Krishna V. Palem; Kiran Puttaswamy; Weng-Fai Wong

It is widely acknowledged that even as VLSI technology advances, there is a looming crisis that is an important obstacle to the widespread deployment of mobile embedded devices, namely that of power. This problem can be tackled at many levels like devices, logic, operating systems, micro-architecture and compiler. While there have been various proposals for specific compiler optimizations for power, there has not been any attempt to systematically map out the space for possible improvements. In this paper, we quantitatively characterize the limits of what a compiler can do in optimizing for power using precise modeling of a state-of-the-art embedded processor in conjunction with a robust compiler. We provide insights to how compiler optimizations interact with the internal workings of a processor from the perspective of power consumption. The goal is to point out the promising and not so promising directions of work in this area, to guide the future compiler designer.

ieee computer society annual symposium on vlsi | 2006

Implementing register files for high-performance microprocessors in a die-stacked (3D) technology

Kiran Puttaswamy; Gabriel H. Loh

3D integration is a new technology that greatly increases transistor density while providing faster on-chip communication. 3D integration stacks multiple die connected with a very high-density and low-latency interface which provides increased device density and the ability to place and route in the third dimension. While past studies have explored 3D integrated on-chip caches, this research explores the implementation of register files, which have very different capacity and bandwidth requirements. Partitioning the register file across multiple die reduces the lengths of many critical wires, which provides both latency and energy benefits. In particular, a 3D implementation of 256-entry physical register file in a two-die stack achieves a 24.1% latency improvement with a simultaneous energy reduction of 58.5%, while a four-die version achieves a 36.0% latency improvement with a 58.2% energy reduction. Our results demonstrate that 3D integration is a promising approach for improving both the performance and power of wire-dominated circuits

languages compilers and tools for embedded systems | 2002

Design space optimization of embedded memory systems via data remapping

Krishna V. Palem; Rodric M. Rabbah; Vincent John Mooney; Kiran Puttaswamy

In this paper, we provide a novel compile-time data remapping algorithm that runs in linear time. This remapping algorithm is the first fully automatic approach applicable to pointer-intensive dynamic applications. We show that data remapping can be used to significantly reduce the energy consumed as well as the memory size needed to meet a user-specified performance goal (i.e., execution time) -- relative to the same application executing without being remapped. These twin advantages afforded by a remapped program -- reduced cache size and energy needs -- constitute a key step in a framework for design space exploration: for any given performance goal, remapping allows the user to reduce the primary and secondary cache size by 50%, yielding a concomitant energy savings of 57%. Additionally, viewed as a compiler optimization for a fixed processor, we show that remapping improves the energy consumed by the cache subsystem by 25%. All of the above savings are in the context of the cache subsystem in isolation. We also show that remapping yields an average 20% energy saving for an ARM-like processor and cache subsystem. All of our improvements are achieved in the context of DIS, Olden and SPEC2000 pointer-centric benchmarks.

international symposium on systems synthesis | 2002

System level power-performance trade-offs in embedded systems using voltage and frequency scaling of off-chip buses and memory

Kiran Puttaswamy; Kyu-won Choi; Jun Cheol Park; Vincent John Mooney; Abhijit Chatterjee; Peeter Ellervee

In embedded systems, off-chip buses and memory (i.e., L2 memory as opposed to the L1 memory which is usually on-chip cache) consume significant power often more than the processor itself. In this paper for the case of an embedded system with one processor chip and one memory chip, we propose frequency and voltage scaling of the off-chip buses and the memory chip and use a known micro-architectural enhancement called a store buffer to reduce the resulting impact on execution time. Our benchmarks show a system (processor + off-chip bus + off-chip memory) power savings of 28% to 36%, an energy savings of 13% to 35%, all while increasing the execution time in the range of 1% to 29%. Previous work in power-aware computing has focused on frequency and voltage scaling of the processors or selective power-down of sub-sets of off-chip memory chips. This paper quantitatively explores voltage/frequency scaling of off-chip buses and memory as a means of trading off performance for power/energy at the system level in embedded systems.

Explore More