Shekhar Srikantaiah | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shekhar Srikantaiah is active.

Explore More

Publication

Featured researches published by Shekhar Srikantaiah.

architectural support for programming languages and operating systems | 2008

Adaptive set pinning: managing shared caches in chip multiprocessors

Shekhar Srikantaiah; Mahmut T. Kandemir; Mary Jane Irwin

As part of the trend towards Chip Multiprocessors (CMPs) for the next leap in computing performance, many architectures have explored sharing the last level of cache among different processors for better performance-cost ratio and improved resource allocation. Shared cache management is a crucial CMP design aspect for the performance of the system. This paper first presents a new classification of cache misses - CII: Compulsory, Inter-processor and Intra-processor misses - for CMPs with shared caches to provide a better understanding of the interactions between memory transactions of different processors at the level of shared cache in a CMP. We then propose a novel approach, called set pinning, for eliminating inter-processor misses and reducing intra-processor misses in a shared cache. Furthermore, we show that an adaptive set pinning scheme improves over the benefits obtained by the set pinning scheme by significantly reducing the number of off-chip accesses. Extensive analysis of these approaches with SPEComp 2001 benchmarks is performed using a full system simulator. Our experiments indicate that the set pinning scheme achieves an average improvement of 22.18% in the L2 miss rate while the adaptive set pinning scheme reduces the miss rates by an average of 47.94% as compared to the traditional shared cache scheme. They also improve the performance by 7.24% and 17.88% respectively.

international symposium on microarchitecture | 2009

SHARP control: controlled shared cache management in chip multiprocessors

Shekhar Srikantaiah; Mahmut T. Kandemir; Qian Wang

Shared resources in a chip multiprocessors (CMPs) pose unique challenges to the seamless adoption of CMPs in virtualization environments and high performance computing systems. While sharing resources like on-chip last level cache is generally beneficial due to increased resource utilization, lack of control over management of these resources can lead to loss of determinism, faded performance isolation, and an overall lack of the notion of quality of service (QoS) provided to individual applications. This has direct ramifications on adhering to service level agreements in environments involving consolidation of multiple heterogeneous workloads. Although providing QoS in presence of shared resources has been addressed in the literature, it has been commonly observed that reservation of resources for QoS leads to under-utilization of resources. This paper proposes the use of formal control theory for dynamically partitioning the shared last level cache in CMPs by optimizing the last level cache space utilization among multiple concurrently executing applications with well defined service level objectives. The advantage of using formal feedback control lies in the theoretical guarantee we can provide about maximizing the utilization of the cache space in a fair manner. Using feedback control, we demonstrate that our fair speedup improvement scheme regulates cache allocation to applications dynamically such that we achieve a high fair speedup (global performance fairness metric). We also propose an adaptive, feedback control based cache partitioning scheme that achieves service differentiation among various applications with minimal impact on the fair speedup. Extensive simulations using a full system simulator with accurate timing models and a set of diverse multiprogrammed workloads show that our fair speedup improvement scheme achieves 21.9% improvement on the fair speedup metric across various benchmarks and our service differentiation scheme achieves well regulated service differentiation.

measurement and modeling of computer systems | 2011

METE: meeting end-to-end QoS in multicores through system-wide resource management

Akbar Sharifi; Shekhar Srikantaiah; Asit K. Mishra; Mahmut T. Kandemir; Chita R. Das

Management of shared resources in emerging multicores for achieving predictable performance has received considerable attention in recent times. In general, almost all these approaches attempt to guarantee a certain level of performance QoS (weighted IPC, harmonic speedup, etc) by managing a single shared resource or at most a couple of interacting resources. A fundamental shortcoming of these approaches is the lack of coordination between these shared resources to satisfy a system level QoS. This is undesirable because providing end-to-end QoS in future multicores is essential for supporting wide-spread adoption of these architectures in virtualized servers and cloud computing systems. An initial step towards such an end-to-end QoS support in multicores is to ensure that at least the major computational and memory resources on-chip are managed efficiently in a coordinated fashion. In this paper, we propose METE, a platform for end-to-end on-chip resource management in multicore processors. Assuming that each application specifies a performance target/SLA, the main objective of METE is to dynamically provision sufficient on-chip resources to applications for achieving the specified targets. METE employs a feedback based system, designed as a Single-Input, Multiple-Output (SIMO) controller with an Auto-Regressive-Moving-Average (ARMA) model, to capture the behaviors of different applications. We evaluate a specific implementation of METE that manages cores, shared caches and off-chip bandwidth in an integrated manner on 8 and 16 core systems using a detailed full system simulator and workloads derived from the SPECOMP and SPECJBB multithreaded benchmarks. The collected results indicate that our proposed scheme is able to provision shared resources among co-runner applications dynamically over the course of execution, to provide end-to-end QoS and satisfy specified performance targets. Furthermore, the elegance of the control theory based multi-layer resource provisioning is in assuring QoS guarantees.

ieee international conference on high performance computing data and analytics | 2010

CPM in CMPs: Coordinated Power Management in Chip-Multiprocessors

Asit K. Mishra; Shekhar Srikantaiah; Mahmut T. Kandemir; Chita R. Das

Multiple clock domain architectures have recently been proposed to alleviate the power problem in CMPs by having different frequency/voltage values assigned to each domain based on workload requirements. However, accurate allocation of power to these voltage/frequency islands based on time varying workload characteristics as well as controlling the power consumption at the provisioned power level is quite non-trivial. Toward this end, we propose a two-tier feedback-based control theoretic solution. Our first-tier consists of a global power manager that allocates power targets to individual islands based on the workload dynamics. The power consumptions of these islands are in turn controlled by a second-tier, consisting of local controllers that regulate island power using dynamic voltage and frequency scaling in response to workload requirements.

international conference on parallel architectures and compilation techniques | 2012

PEPON: performance-aware hierarchical power budgeting for NoC based multicores

Akbar Sharifi; Asit K. Mishra; Shekhar Srikantaiah; Mahmut T. Kandemir; Chita R. Das

Targeting NoC based multicores, we propose a two-level power budget distribution mechanism, called PEPON, where the first level distributes the overall power budget of the multicore system among various types of on-chip resources like the cores, caches, and NoC, and the second level determines the allocation of power to individual instances of each type of resource. Both these distributions are oriented towards maximizing workload performance without exceeding the specified power budget. Extensive experimental evaluations of the proposed power distribution scheme using a full system simulation and detailed power models emphasize the importance of power budget partitioning at both levels. Specifically, our results show that the proposed scheme can provide up to 29% performance improvement as compared to no power budgeting, and performs 13% better than a competing scheme, under the same chip-wide power cap.

programming language design and implementation | 2010

Cache topology aware computation mapping for multicores

Mahmut T. Kandemir; Taylan Yemliha; SaiPrashanth Muralidhara; Shekhar Srikantaiah; Mary Jane Irwin; Yuanrui Zhnag

The main contribution of this paper is a compiler based, cache topology aware code optimization scheme for emerging multicore systems. This scheme distributes the iterations of a loop to be executed in parallel across the cores of a target multicore machine and schedules the iterations assigned to each core. Our goal is to improve the utilization of the on-chip multi-layer cache hierarchy and to maximize overall application performance. We evaluate our cache topology aware approach using a set of twelve applications and three different commercial multicore machines. In addition, to study some of our experimental parameters in detail and to explore future multicore machines (with higher core counts and deeper on-chip cache hierarchies), we also conduct a simulation based study. The results collected from our experiments with three Intel multicore machines show that the proposed compiler-based approach is very effective in enhancing performance. In addition, our simulation results indicate that optimizing for the on-chip cache hierarchy will be even more important in future multicores with increasing numbers of cores and cache levels.

high-performance computer architecture | 2011

MorphCache: A Reconfigurable Adaptive Multi-level Cache hierarchy

Shekhar Srikantaiah; Emre Kultursay; Tao Zhang; Mahmut T. Kandemir; Mary Jane Irwin; Yuan Xie

Given the diverse range of application characteristics that chip multiprocessors (CMPs) need to cater to, a “one-cache-topology-fits-all” design philosophy will clearly be inadequate. In this paper, we propose MorphCache, a Reconfigurable Adaptive Multi-level Cache hierarchy. Mor-phCache dynamically tunes a multi-level cache topology in a CMP to allow significantly different cache topologies to exist on the same architecture. Starting from per-core L2 and L3 cache slices as the basic design point, MorphCache alters the cache topology dynamically by merging or splitting cache slices and modifying the accessibility of different cache slice groups to different cores in a CMP. We evaluated MorphCache on a 16 core CMP on a full system simulator and found that it significantly improves both average throughput and harmonic mean of speedups of diverse multithreaded and multiprogrammed workloads. Specifically, our results show that MorphCache improves throughput of the multiprogrammed mixes by 29.9% over a topology with all-shared L2 and L3 caches and 27.9% over a topology with per core private L2 cache and shared L3 cache. In addition, we also compared MorphCache to partitioning a single shared cache at each level using promotion/insertion pseudo-partitioning (PIPP) [28] and managing per-core private cache at each level using dynamic spill receive caches (DSR) [18]. We found that MorphCache improves average throughput by 6.6% over PIPP and by 5.7% over DSR when applied to both L2 and L3 caches.

ieee international conference on high performance computing data and analytics | 2009

A case for integrated processor-cache partitioning in chip multiprocessors

Shekhar Srikantaiah; Reetuparna Das; Asit K. Mishra; Chita R. Das; Mahmut T. Kandemir

Existing cache partitioning schemes are designed in a manner oblivious to the implicit processor partitioning enforced by the operating system. This paper examines an operating system directed integrated processor-cache partitioning scheme that partitions both the available processors and the shared cache in a chip multiprocessor among different multi-threaded applications. Extensive simulations using a set of multiprogrammed workloads show that our integrated processor-cache partitioning scheme facilitates achieving better performance isolation as compared to state of the art hardware/software based solutions. Specifically, our integrated processor-cache partitioning approach performs, on an average, 20.83% and 14.14% better than equal partitioning and the implicit partitioning enforced by the underlying operating system, respectively, on the fair speedup metric on an 8 core system. We also compare our approach to processor partitioning alone and a state-of-the-art cache partitioning scheme and our scheme fares 8.21% and 9.19% better than these schemes on a 16 core system.

international symposium on computer architecture | 2014

HIOS: a host interface I/O scheduler for solid state disks

Myoungsoo Jung; Wonil Choi; Shekhar Srikantaiah; Joonhyuk Yoo; Mahmut T. Kandemir

Garbage collection (GC) and resource contention on I/O buses (channels) are among the critical bottlenecks in Solid State Disks (SSDs) that cannot be easily hidden. Most existing I/O scheduling algorithms in the host interface logic (HIL) of state-of-the-art SSDs are oblivious to such low-level performance bottlenecks in SSDs. As a result, SSDs may violate quality of service (QoS) requirements by not being able to meet the deadlines of I/O requests. In this paper, we propose a novel host interface I/O scheduler that is both GC-aware and QoS-aware. The proposed scheduler redistributes the GC overheads across non-critical I/O requests and reduces channel resource contention. Our experiments with workloads from various application domains reveal that the proposed scheduler reduces the standard deviation for latency over state-of-the-art I/O schedulers used in the HIL by 52.5%, and the worst-case latency by 86.6%. In addition, for I/O requests with sizes smaller than a superpage, our proposed scheduler avoids channel resource conflicts and reduces latency by 29.2% compared to the state-of-the-art.

design automation conference | 2012

Courteous cache sharing: being nice to others in capacity management

Akbar Sharifi; Shekhar Srikantaiah; Mahmut T. Kandemir; Mary Jane Irwin

This paper proposes a cache management scheme for multiprogrammed, multithreaded applications, with the objective of obtaining maximum performance for both individual applications and the multithreaded workload mix. In this scheme, each individual applications performance is improved by increasing the priority of its slowest thread, while the overall system performance is improved by ensuring that each individual applications performance benefit does not come at the cost of a significant degradation to other applications threads that are sharing the same cache. Averaged over six workloads, our shared cache management scheme improves the performance of the combination of applications by 18%. These improvements across applications in each mix are also fair, as indicated by average fair speedup improvements of 10% across the threads of each application (averaged over all the workloads).

Explore More