Srihari Makineni | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Srihari Makineni is active.

Explore More

Publication

Featured researches published by Srihari Makineni.

measurement and modeling of computer systems | 2007

QoS policies and architecture for cache/memory in CMP platforms

Ravi R. Iyer; Li Zhao; Fei Guo; Ramesh Illikkal; Srihari Makineni; Donald Newell; Yan Solihin; Lisa R. Hsu; Steven K. Reinhardt

As we enter the era of CMP platforms with multiple threads/cores on the die, the diversity of the simultaneous workloads running on them is expected to increase. The rapid deployment of virtualization as a means to consolidate workloads on to a single platform is a prime example of this trend. In such scenarios, the quality of service (QoS) that each individual workload gets from the platform can widely vary depending on the behavior of the simultaneously running workloads. While the number of cores assigned to each workload can be controlled, there is no hardware or software support in todays platforms to control allocation of platform resources such as cache space and memory bandwidth to individual workloads. In this paper, we propose a QoS-enabled memory architecture for CMP platforms that addresses this problem. The QoS-enabled memory architecture enables more cache resources (i.e. space) and memory resources (i.e. bandwidth) for high priority applications based on guidance from the operating environment. The architecture also allows dynamic resource reassignment during run-time to further optimize the performance of the high priority application with minimal degradation to low priority. To achieve these goals, we will describe the hardware/software support required in the platform as well as the operating environment (O/S and virtual machine monitor). Our evaluation framework consists of detailed platform simulation models and a QoS-enabled version of Linux. Based on evaluation experiments, we show the effectiveness of a QoS-enabled architecture and summarize key findings/trade-offs.

IEEE Computer | 2004

TCP onloading for data center servers

G. Regnier; Srihari Makineni; I. Illikkal; Ravishankar Iyer; David B. Minturn; R. Huggahalli; Donald Newell; L. Cline; A. Foong

To meet the increasing networking needs of server workloads, servers are starting to offload packet processing to peripheral devices to achieve TCP/IP acceleration. Researchers at Intel Labs have experimented with alternative solutions that improve the servers ability to process TCP/IP packets efficiently and at very high rates.

high-performance computer architecture | 2010

CHOP: Adaptive filter-based DRAM caching for CMP server platforms

Xiaowei Jiang; Niti Madan; Li Zhao; Mike Upton; Ravishankar R. Iyer; Srihari Makineni; Donald Newell; Yan Solihin; Rajeev Balasubramonian

As manycore architectures enable a large number of cores on the die, a key challenge that emerges is the availability of memory bandwidth with conventional DRAM solutions. To address this challenge, integration of large DRAM caches that provide as much as 5× higher bandwidth and as low as 1/3rd of the latency (as compared to conventional DRAM) is very promising. However, organizing and implementing a large DRAM cache is challenging because of two primary tradeoffs: (a) DRAM caches at cache line granularity require too large an on-chip tag area that makes it undesirable and (b) DRAM caches with larger page granularity require too much bandwidth because the miss rate does not reduce enough to overcome the bandwidth increase. In this paper, we propose CHOP (Caching HOt Pages) in DRAM caches to address these challenges. We study several filter-based DRAM caching techniques: (a) a filter cache (CHOP-FC) that profiles pages and determines the hot subset of pages to allocate into the DRAM cache, (b) a memory-based filter cache (CHOP-MFC) that spills and fills filter state to improve the accuracy and reduce the size of the filter cache and (c) an adaptive DRAM caching technique (CHOP-AFC) to determine when the filter cache should be enabled and disabled for DRAM caching. We conduct detailed simulations with server workloads to show that our filter-based DRAM caching techniques achieve the following: (a) on average over 30% performance improvement over previous solutions, (b) several magnitudes lower area overhead in tag space required for cache-line based DRAM caches, (c) significantly lower memory bandwidth consumption as compared to page-granular DRAM caches.

international conference on parallel architectures and compilation techniques | 2007

CacheScouts: Fine-Grain Monitoring of Shared Caches in CMP Platforms

Li Zhao; Ravi R. Iyer; Ramesh Illikkal; Jaideep Moses; Srihari Makineni; Donald Newell

As multi-core architectures flourish in the marketplace, multi-application workload scenarios (such as server consolidation) are growing rapidly. When running multiple applications simultaneously on a platform, it has been shown that contention for shared platform resources such as last-level cache can severely degrade performance and quality of service (QoS). But todays platforms do not have the capability to monitor shared cache usage accurately and disambiguate its effects on the performance behavior of each individual application. In this paper, we investigate low-overhead mechanisms for fine-grain monitoring of the use of shared cache resources along three vectors: (a) occupancy - how much space is being used and by whom, (b) interference - how much contention is present and who is being affected and (c) sharing - how are threads cooperating. We propose the CacheScouts monitoring architecture consisting of novel tagging (software-guided monitoring IDs), and sampling mechanisms (set sampling) to achieve shared cache monitoring on per application basis at low overhead (<0.1%) and with very little loss of accuracy (<5%). We also present case studies to show how CacheScouts can be used by operating systems (OS) and virtual machine monitors (VMMs) for (a) characterizing execution profiles, (b) optimizing scheduling for performance management, (c) providing QoS and (d) metering for chargeback.

ACM Sigarch Computer Architecture News | 2005

Exploring the cache design space for large scale CMPs

Lisa R. Hsu; Ravishankar R. Iyer; Srihari Makineni; Steven K. Reinhardt; Donald Newell

With the advent of dual-core chips in the marketplace, small-scale CMP (chip multiprocessor) architectures are becoming commonplace. We expect a continuing trend of increasing the number of cores on a die to maximize the performance/power efficiency of a single chip. We believe an era of large-scale CMPs (LCMPs) with several tens to hundreds of cores is on the way, but as of now architects have little understanding of how best to build a cache hierarchy given such a large number of cores/threads to support. With this in mind, our initial goals are to prune the cache design space for LCMPs by characterizing basic server workload behavior in such an environment.In this paper, we describe the range of methodologies that we are developing to overcome the challenges of exploring the cache design space for LCMP platforms. We then focus on employing a trace-driven approach to characterizing one key server workload (OLTP) in both a homogeneous and a heterogeneous workload environment. We study the effect of increasing threads (from 1 to 128) on a three-level cache hierarchy with emphasis on second and third level caches. We study the effect of varying sizes at these cache levels and show the effects of threads contending for cache space, the effects of prefetching instruction addresses, and the effects of inclusion. We make initial observations and conclusions about the factors on which LCMP cache hierarchy design decisions should be based and discuss future work.

First International Workshop on Virtualization Technology in Distributed Computing (VTDC 2006) | 2006

Characterization of network processing overheads in Xen

Padma Apparao; Srihari Makineni; Don Newell

I/O virtualization techniques developed recently have led to significant changes in network processing. These techniques require network packets go through additional layers of processing. These additional layers have introduced significant overheads. So it is important to understand performance implications of this additional processing on network processing (TCP/IP). Our goals in this paper are to measure network I/O performance in a Xen virtualized environment and to provide a detailed architectural characterization of network processing highlighting major sources of overheads and their impact. In this paper, we study two modes of I/O virtualizations: 1) running I/O service VM along with the guest on the same CPU, and 2) running I/O service VM on a separate CPU. We measure TCP/IP processing performance in these two modes and compare it to that of on the native Linux machine. Our measurements show that both Rx and Tx performance suffer by more than 50% in virtualized environment. We have noticed that pathlength has increased by 3 to 4 times than that of the native processing. Most of this overhead comes from the Xen VMM layer and Dom0 VM processing. Our data also shows that running the Dom0 VM on a separate CPU is more expensive than running both Dom0 and guest VM on the same CPU. We provide a detailed characterization of this additional processing which we hope will help the Xen community focus on right areas for optimization.

high-performance computer architecture | 2009

Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy

Niti Madan; Li Zhao; Naveen Muralimanohar; Aniruddha N. Udipi; Rajeev Balasubramonian; Ravishankar R. Iyer; Srihari Makineni; Donald Newell

Cache hierarchies in future many-core processors are expected to grow in size and contribute a large fraction of overall processor power and performance. In this paper, we postulate a 3D chip design that stacks SRAM and DRAM upon processing cores and employs OS-based page coloring to minimize horizontal communication of cache data. We then propose a heterogeneous reconfigurable cache design that takes advantage of the high density of DRAM and the superior power/delay characteristics of SRAM to efficiently meet the working set demands of each individual core. Finally, we analyze the communication patterns for such a processor and show that a tree topology is an ideal fit that significantly reduces the power and latency requirements of the on-chip network. The above proposals are synergistic: each proposal is made more compelling because of its combination with the other innovations described in this paper. The proposed reconfigurable cache model improves performance by up to 19% along with 48% savings in network power.

international symposium on performance analysis of systems and software | 2005

Anatomy and Performance of SSL Processing

Li Zhao; Ravi R. Iyer; Srihari Makineni; Laxmi N. Bhuyan

A wide spectrum of e-commerce (B2B/B2C), banking, financial trading and other business applications require the exchange of data to be highly secure. The Secure Sockets Layer (SSL) protocol provides the essential ingredients of secure communications - privacy, integrity and authentication. Though it is well-understood that security always comes at the cost of performance, these costs depend on the cryptographic algorithms. In this paper, we present a detailed description of the anatomy of a secure session. We analyze the time spent on the various cryptographic operations (symmetric, asymmetric and hashing) during the session negotiation and data transfer. We then analyze the most frequently used cryptographic algorithms (RSA, AES, DES, 3DES, RC4, MD5 and SHA-1). We determine the key components of these algorithms (setting up key schedules, encryption rounds, substitutions, permutations, etc) and determine where most of the time is spent. We also provide an architectural analysis of these algorithms, show the frequently executed instructions and discuss the ISA/hardware support that may be beneficial to improving SSL performance. We believe that the performance data presented in this paper is useful to performance analysts and processor architects to help accelerate SSL performance in future processors

high-performance computer architecture | 2004

Architectural characterization of TCP/IP packet processing on the Pentium/spl reg/ M microprocessor

Srihari Makineni; Ravi R. Iyer

A majority of the current and next generation server applications (Web services, e-commerce, storage, etc.) employ TCP/IP as the communication protocol of choice. As a result, the performance of these applications is heavily dependent on the efficient TCP/IP packet processing within the termination nodes. This dependency becomes even greater as the bandwidth needs of these applications grow from 100 Mbps to 1 Gbps to 10 Gbps in the near future. Motivated by this, we focus on the following: (a) to understand the performance behavior of the various modes of TCP/IP processing, (b) to analyze the underlying architectural characteristics of TCP/IP packet processing and (c) to quantify the computational requirements of the TCP/IP packet processing component within realistic workloads. We achieve these goals by performing an in-depth analysis of packet processing performance on Intels state-of-the-art low power Pentium/spl reg/ M microprocessor running the Microsoft Windows* Server 2003 operating system. Some of our key observations are - (i) that the mode of TCP/IP operation can significantly affect the performance requirements, (ii) that transmit-side processing is largely compute-intensive as compared to receive-side processing which is more memory-bound and (iii) that the computational requirements for sending/receiving packets can form a substantial component (28% to 40%) of commercial server workloads. From our analysis, we also discuss architectural as well as stack-related improvements that can help achieve higher server network throughput and result in improved application performance.

international symposium on microarchitecture | 2006

Molecular Caches: A caching structure for dynamic creation of application-specific Heterogeneous cache regions

Keshavan Varadarajan; S. K. Nandy; Vishal Sharda; Amrutur Bharadwaj; Ravi R. Iyer; Srihari Makineni; Donald Newell

CMPs enable simultaneous execution of multiple applications on the same platforms that share cache resources. Diversity in the cache access patterns of these simultaneously executing applications can potentially trigger inter-application interference, leading to cache pollution. Whereas a large cache can ameliorate this problem, the issues of larger power consumption with increasing cache size, amplified at sub-100nm technologies, makes this solution prohibitive. In this paper, in order to address the issues relating to power-aware performance of caches, we propose a caching structure that addresses the following: 1) Definition of application-specific cache partitions as an aggregation of caching units (molecules). The parameters of each molecule namely size, associativity and line size are chosen so that the power consumed by it and access time are optimal for the given technology. 2) Application-specific resizing of cache partitions with variable and adaptive associativity per cache line, way size and variable line size. 3) A replacement policy that is transparent to the partition in terms of size, heterogeneity in associativity and line size. Through simulation studies we establish the superiority of molecular cache (caches built as aggregations of molecules) that offers a 29% power advantage over that of an equivalently performing traditional cache

Explore More