Ramakrishnan Rajamony

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ramakrishnan Rajamony is active.

Explore More

Publication

Featured researches published by Ramakrishnan Rajamony.

IEEE Computer | 2004

Power and energy management for server systems

Ricardo Bianchini; Ramakrishnan Rajamony

This survey shows that heterogeneous server clusters can be made more efficient by conserving power and energy while exploiting information from the service level, such as request priorities established by service-level agreements.

international symposium on computer architecture | 2009

Hybrid cache architecture with disparate memory technologies

Xiaoxia Wu; Jian Li; Lixin Zhang; Evan Speight; Ramakrishnan Rajamony; Yuan Xie

Caching techniques have been an efficient mechanism for mitigating the effects of the processor-memory speed gap. Traditional multi-level SRAM-based cache hierarchies, especially in the context of chip multiprocessors (CMPs), present many challenges in area requirements, core-to-cache balance, power consumption, and design complexity. New advancements in technology enable caches to be built from other technologies, such as Embedded DRAM (EDRAM), Magnetic RAM (MRAM), and Phase-change RAM (PRAM), in both 2D chips or 3D stacked chips. Caches fabricated in these technologies offer dramatically different power and performance characteristics when compared with SRAM-based caches, particularly in the areas of access latency, cell density, and overall power consumption. In this paper, we propose to take advantage of the best characteristics that each technology offers, through the use of Hybrid Cache Architecture (HCA) designs. We discuss and evaluate two types of hybrid cache architectures: inter cache Level HCA (LHCA), in which the levels in a cache hierarchy can be made of disparate memory technologies; and intra cache level or cache Region based HCA (RHCA), where a single level of cache can be partitioned into multiple regions, each of a different memory technology. We have studied a number of different HCA architectures and explored the potential of hardware support for intra-cache data movement and power consumption management within HCA caches. Utilizing a full-system simulator that has been validated against real hardware, we demonstrate that an LHCA design can provide a geometric mean 7% IPC improvement over a baseline 3-level SRAM cache design under the same area constraint across a collection of 25 workloads. A more aggressive RHCA-based design provides 12% IPC improvement over the baseline. Finally, a 2-layer 3D cache stack (3DHCA) of high density memory technology within the same chip footprint gives 18% IPC improvement over the baseline. Furthermore, up to 70% reduction in power consumption over a baseline SRAM-only design is achieved.

international conference on supercomputing | 2002

Critical power slope: understanding the runtime effects of frequency scaling

Akihiko Miyoshi; Charles R. Lefurgy; Eric Van Hensbergen; Ramakrishnan Rajamony; Raj Rajkumar

Energy efficiency is becoming an increasingly important feature for both mobile and high-performance server systems. Most processors designed today include power management features that provide processor operating points which can be used in power management algorithms. However, existing power management algorithms implicitly assume that lower performance points are more energy efficient than higher performance points. Our empirical observations indicate that for many systems, this assumption is not valid.We introduce a new concept called critical power slope to explain and capture the power-performance characteristics of systems with power management features. We evaluate three systems - a clock throttled Pentium laptop, a frequency scaled PowerPC platform, and a voltage scaled system to demonstrate the benefits of our approach. Our evaluation is based on empirical measurements of the first two systems, and publicly available data for the third. Using critical power slope, we explain why on the Pentium-based system, it is energy efficient to run only at the highest frequency, while on the PowerPC-based system, it is energy efficient to run at the lowest frequency point. We confirm our results by measuring the behavior of a web serving benchmark. Furthermore, we extend the critical power slope concept to understand the benefits of voltage scaling when combined with frequency scaling. We show that in some cases, it may be energy efficient not to reduce voltage below a certain point.

conference on high performance computing (supercomputing) | 2005

On the Feasibility of Optical Circuit Switching for High Performance Computing Systems

Kevin J. Barker; Alan F. Benner; Raymond R. Hoare; Adolfy Hoisie; Darren J. Kerbyson; Dan Li; Rami G. Melhem; Ramakrishnan Rajamony; Eugen Schenfeld; Shuyi Shao; Craig B. Stunkel; Peter A. Walker

The interconnect plays a key role in both the cost and performance of large-scale HPC systems. The cost of future high-bandwidth electronic interconnects mushrooms due to expensive optical transceivers needed between electronic switches. We describe a potentially cheaper and more power-efficient approach to building high-performance interconnects. Through empirical analysis of HPC applications, we find that the bulk of inter-processor communication (barring collectives) is bounded in degree and changes very slowly or never. Thus we propose a two-network interconnect: An Optical Circuit Switching (OCS) network handling long-lived bulk data transfers, using optical switches; and a secondary lower-bandwidth Electronic Packet Switching (EPS) network. An OCS could be significantly cheaper, as it uses fewer optical transceivers than an electronic network. Collectives and transient communication packets traverse the electronic network. We present compiler techniques and dynamic run-time policies, using this two-network interconnect. Simulation results show that our approach provides high performance at low cost.

high performance interconnects | 2010

The PERCS High-Performance Interconnect

L. Baba Arimilli; Ravi Kumar Arimilli; Vicente Enrique Chung; Scott Douglas Clark; Wolfgang E. Denzel; Ben C. Drerup; Torsten Hoefler; Jody B. Joyner; Jerry Don Lewis; Jian Li; Nan Ni; Ramakrishnan Rajamony

The PERCS system was designed by IBM in response to a DARPA challenge that called for a high-productivity high-performance computing system. A major innovation in the PERCS design is the network that is built using Hub chips that are integrated into the compute nodes. Each Hub chip is about 580 mm

international symposium on computer architecture | 2005

Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors

Evan Speight; Hazim Shafi; Lixin Zhang; Ramakrishnan Rajamony

ACM Transactions on Architecture and Code Optimization | 2010

Design exploration of hybrid caches with disparate memory technologies

Xiaoxia Wu; Jian Li; Lixin Zhang; Evan Speight; Ramakrishnan Rajamony; Yuan Xie

in size, % uses 45 nm IBM CMOS 12S0 SOI technology with 13 levels of metal, has over 3700 signal I/Os, and is packaged in a module that also contains LGA-attached optical electronic devices. The Hub module implements five types of high-bandwidth interconnects with multiple links that are fully-connected with a high-performance internal crossbar switch. These links provide over 9 Tbits/second of raw bandwidth and are used to construct a two-level direct-connect topology spanning up to tens of thousands of \PS{} chips with high bisection bandwidth and low latency. The Blue Waters System, which is being constructed at NCSA, is an exemplar large-scale PERCS installation. Blue Waters is expected to deliver sustained Pet scale performance over a wide range of applications. The Hub chip supports several high-performance computing protocols (e.g., MPI, RDMA, IP) and also provides a non-coherent system-wide global address space. Collective communication operations such as barriers, reductions, and multi-cast are supported directly in hardware. Multiple routing modes including deterministic as well as hardware-directed random routing are also supported. Finally, the Hub module is capable of operating in the presence of many types of hardware faults and gracefully degrades performance in the presence of lane failures.

Ibm Journal of Research and Development | 2011

PERCS: the IBM power7-IH high-performance computing system

Ramakrishnan Rajamony; L. B. Arimilli; Kevin J. Gildea

With the ability to place large numbers of transistors on a single silicon chip, manufacturers have begun developing chip multiprocessors (CMPs) containing multiple processor cores, varying amounts of level 1 and level 2 caching, and on-chip directory structures for level 3 caches and memory. The level 3 cache may be used as a victim cache for both modified and clean lines evicted from on-chip level 2 caches. Efficient area and performance management of this cache hierarchy is paramount given the projected increase in access latency to off-chip memory. This paper proposes simple architectural extensions and adaptive policies for managing the L2 and L3 cache hierarchy in a CMP system. In particular, we evaluate two mechanisms that improve cache effectiveness. First, we propose the use of a small history table to provide hints to the L2 caches as to which lines are resident in the L3 cache. We employ this table to eliminate some unnecessary clean write backs to the L3 cache, reducing pressure on the L3 cache and utilization of the on-chip bus. Second, we exam-ine the performance benefits of allowing write backs from L2 caches to be placed in neighboring, on-chip L2 caches rather than forcing them to be absorbed by the L3 cache. This not only reduces the capacity pressure on the L3 cache but also makes subsequent accesses faster since L2-W-L2 cache transfers have typically lower latencies than accesses to a large L3 cache array. We evaluate the performance improvement of these two designs, and their combined effect, on four commercial workloads and observe a reduction in the overall execution time of up to 13%.

Ibm Journal of Research and Development | 2001

Experience with building a commodity intel-based ccNUMA system

Bishop Brock; Gary D. Carpenter; E. Chiprout; Mark Edward Dean; P. L. De Backer; Elmootazbellah Nabil Elnozahy; Hubertus Franke; Mark E. Giampapa; David Brian Glasco; James L. Peterson; Ramakrishnan Rajamony; R. Ravindran; Freeman L. Rawson; Ronald Lynn Rockhold; Juan C. Rubio

Traditional multilevel SRAM-based cache hierarchies, especially in the context of chip multiprocessors (CMPs), present many challenges in area requirements, core--to--cache balance, power consumption, and design complexity. New advancements in technology enable caches to be built from other technologies, such as Embedded DRAM (EDRAM), Magnetic RAM (MRAM), and Phase-change RAM (PRAM), in both 2D chips or 3D stacked chips. Caches fabricated in these technologies offer dramatically different power-performance characteristics when compared with SRAM-based caches, particularly in the areas of access latency, cell density, and overall power consumption. In this article, we propose to take advantage of the best characteristics that each technology has to offer through the use of Hybrid Cache Architecture (HCA) designs. We discuss and evaluate two types of hybrid cache architectures: intercache-Level HCA (LHCA), in which the levels in a cache hierarchy can be made of disparate memory technologies; and intracache-level or cache-Region-based HCA (RHCA), where a single level of cache can be partitioned into multiple regions, each of a different memory technology. We have studied a number of different HCA architectures and explored the potential of hardware support for intracache data movement and power consumption management within HCA caches. Utilizing a full-system simulator that has been validated against real hardware, we demonstrate that an LHCA design can provide a geometric mean 6% IPC improvement over a baseline 3-level SRAM cache design under the same area constraint across a collection of 30 workloads. A more aggressive RHCA-based design provides 10% IPC improvement over the baseline. A 2-layer 3D cache stack (3DHCA) of high density memory technology within the same chip footprint gives 16% IPC improvement over the baseline. We also achieve up to a 72% reduction in power consumption over a baseline SRAM-only design. Energy-delay and thermal evaluation for 3DHCA are also presented. In addition to the fast-slow region based RHCA, we further evaluate read-write region based RHCA designs.

Ibm Journal of Research and Development | 2016

IBM Bluemix Mobile Cloud Services

Ahmed Gheith; Ramakrishnan Rajamony; Patrick J. Bohrer; Kanak B. Agarwal; Michael Kistler; B. L. White Eagle; C. A. Hambridge; John B. Carter; Todd E. Kaplinger

In 2001, the Defense Advanced Research Projects Agency called for the creation of commercially viable computing systems that would not only perform at very high levels but also be highly productive. The forthcoming POWER7®-IH system known as Productive, Easy-to-use, Reliable Computing System (PERCS) was IBMs response to this challenge. Compared with state-of-the-art high-performance computing systems in existence today, PERCS has very high performance and productivity goals and achieves them through tight integration of computing, networking, storage, and software. This paper describes the PERCS hardware and software, along with many of the design decisions that went into its creation.

Explore More