Sri Hari Krishna Narayanan

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sri Hari Krishna Narayanan is active.

Explore More

Publication

Featured researches published by Sri Hari Krishna Narayanan.

design, automation, and test in europe | 2009

Process variation aware thread mapping for chip multiprocessors

Shengyan Hong; Sri Hari Krishna Narayanan; Mahmut T. Kandemir; Ozcan Ozturk

With the increasing scaling of manufacturing technology, process variation is a phenomenon that has become more prevalent. As a result, in the context of Chip Multiprocessors (CMPs) for example, it is possible that identically-designed processor cores on the chip have non-identical peak frequencies and power consumptions. To cope with such a design, each processor can be assumed to run at the frequency of the slowest processor, resulting in wasted computational capability. This paper considers an alternate approach and proposes an algorithm that intelligently maps (and remaps) computations onto available processors so that each processor runs at its peak frequency. In other words, by dynamically changing the thread-to-processor mapping at runtime, our approach allows each processor to maximize its performance, rather than simply using chip-wide lowest frequency amongst all cores and highest cache latency. Experimental evidence shows that, as compared to a process variation agnostic thread mapping strategy, our proposed scheme achieves as much as 29% improvement in overall execution latency, average improvement being 13% over the benchmarks tested. We also demonstrate in this paper that our savings are consistent across different processor counts, latency maps, and latency distributions.With the increasing scaling of manufacturing technology, process variation is a phenomenon that has become more prevalent. As a result, in the context of Chip Multiprocessors (CMPs) for example, it is possible that identically-designed processor cores on the chip have non-identical peak frequencies and power consumptions. To cope with such a design, each processor can be assumed to run at the frequency of the slowest processor, resulting in wasted computational capability. This paper considers an alternate approach and proposes an algorithm that intelligently maps (and remaps) computations onto available processors so that each processor runs at its peak frequency. In other words, by dynamically changing the thread-to-processor mapping at runtime, our approach allows each processor to maximize its performance, rather than simply using chip-wide lowest frequency amongst all cores and highest cache latency. Experimental evidence shows that, as compared to a process variation agnostic thread mapping strategy, our proposed scheme achieves as much as 29% improvement in overall execution latency, average improvement being 13% over the benchmarks tested. We also demonstrate in this paper that our savings are consistent across different processor counts, latency maps, and latency distributions.

international symposium on microarchitecture | 2009

Optimizing shared cache behavior of chip multiprocessors

Mahmut T. Kandemir; Sai Prashanth Muralidhara; Sri Hari Krishna Narayanan; Yuanrui Zhang; Ozcan Ozturk

One of the critical problems associated with emerging chip multiprocessors (CMPs) is the management of on-chip shared cache space. Unfortunately, single processor centric data locality optimization schemes may not work well in the CMP case as data accesses from multiple cores can create conflicts in the shared cache space. The main contribution of this paper is a compiler directed code restructuring scheme for enhancing locality of shared data in CMPs. The proposed scheme targets the last level shared cache that exist in many commercial CMPs and has two components, namely, allocation, which determines the set of loop iterations assigned to each core, and scheduling, which determines the order in which the iterations assigned to a core are executed. Our scheme restructures the application code such that the different cores operate on shared data blocks at the same time, to the extent allowed by data dependencies. This helps to reduce reuse distances for the shared data and improves on-chip cache performance. We evaluated our approach using the Splash-2 and Parsec applications through both simulations and experiments on two commercial multi-core machines. Our experimental evaluation indicates that the proposed data locality optimization scheme improves inter-core conflict misses in the shared cache by 67% on average when both allocation and scheduling are used. Also, the execution time improvements we achieve (29% on average) are very close to the optimal savings that could be achieved using a hypothetical scheme.

international symposium on low power electronics and design | 2006

Minimizing energy consumption of banked memories using data recomputation

Hakduran Koc; Ozcan Ozturk; Mahmut T. Kandemir; Sri Hari Krishna Narayanan; Ehat Ercanli

Banking has been identified as one of the effective methods using which memory energy can be reduced. We propose a novel approach that improves the energy effectiveness of banked memory architecture by performing extra computations if doing so makes it unnecessary to reactivate a bank which is in the low-power operating mode. More specifically, when an access to a bank, which is in the low-power mode, is to be made, our approach first checks whether the data required from that bank can be recomputed by using the data that are currently stored in already active banks. If this is the case, we do not turn on the bank in question, and instead, recalculate the value of the requested data using the values of the data stored in the active banks. Given the fact that the contribution of the leakage consumption to overall energy budget keeps increasing, the proposed approach has the potential of being even more attractive in the future. Our experimental results collected so far clearly show that this recomputation based approach can reduce energy consumption significantly

international symposium on quality electronic design | 2006

Compiler-Directed Power Density Reduction in NoC-Based Multi-Core Designs

Sri Hari Krishna Narayanan; Mahmut T. Kandemir; Ozcan Ozturk

As transistor counts keep increasing and clock frequencies rise, high power consumption is becoming one of the most important obstacles, preventing further scaling and performance improvements. While high power consumption brings many problems with it, high power density and thermal hotspots are maybe two of the most important ones. Current architectures provide several circuit based solutions to cope with thermal emergencies when they occur but exercising them frequently can lead to significant performance losses. This paper proposes a compiler-based approach that balances the computational workload across the processors of a NoC based chip multiprocessor such that the chances of experiencing a thermal emergency at runtime are reduced. Our results show that the proposed approach cuts the number of runtime thermal emergencies by 42% on the average on benchmarks tested

international conference on computer design | 2005

Temperature-sensitive loop parallelization for chip multiprocessors

Sri Hari Krishna Narayanan; Guilin Chen; Mahmut x. Mahmut Kandemir; Yuan Xie

In this paper, we present and evaluate three temperature-sensitive loop parallelization strategies for array-intensive applications executed on chip multiprocessors in order to reduce the peak temperature. Our experimental results show that the peak (average) temperature can be reduced by 20.9/spl deg/C (4.3/spl deg/C) when averaged over all the applications tested, incurring small performance/power penalties.

international symposium on quality electronic design | 2008

A Scratch-Pad Memory Aware Dynamic Loop Scheduling Algorithm

Ozcan Ozturk; Mahmut T. Kandemir; Sri Hari Krishna Narayanan

Executing array based applications on a chip multiprocessor requires effective loop parallelization techniques. One of the critical issues that need to be tackled by an optimizing compiler in this context is loop scheduling, which distributes the iterations of a loop to be executed in parallel across the available processors. Most of the existing work in this area targets cache based execution platforms. In comparison, this paper proposes the first dynamic loop scheduler, to our knowledge, that targets scratch-pad memory (SPM) based chip multiprocessors, and presents an experimental evaluation of it. The main idea behind our approach is to identify the set of loop iterations that access the SPM and those that do not. This information is exploited at runtime to balance the loads of the processors involved in executing the loop nest at hand. Therefore, the proposed dynamic scheduler takes advantage of the SPM in performing the loop iteration-to-processor mapping. Our experimental evaluation with eight array/loop intensive applications reveals that the proposed scheduler is very effective in practice and brings between 13.7% and 41.7% performance savings over a static loop scheduling scheme, which is also tested in our experiments.

Fourth International IEEE Security in Storage Workshop | 2007

Securing Disk-Resident Data through Application Level Encryption

Ramya Prabhakar; Seung Woo Son; Christina M. Patrick; Sri Hari Krishna Narayanan; Mahmut T. Kandemir

Confidentiality of disk-resident data is critical for end-to-end security of storage systems. While there are several widely used mechanisms for ensuring confidentiality of data in transit, techniques for providing confidentiality when data is stored in a disk subsystem are relatively new. As opposed to prior file system based approaches to this problem, this paper proposes an application-level solution, which allows encryption of select data blocks. We make three major contributions: 1) quantifying the tradeoffs between confidentiality and performance; 2) evaluating a reuse distance oriented approach for selective encryption of disk-resident data; and 3) proposing a profile-guided approach that approximates the behavior of the reuse distance oriented approach. The experiments with five applications that manipulate disk-resident data sets clearly show that our approach enables us to study the confidentiality/performance tradeoffs. Using our approach it is possible to reduce the performance degradation due to encryption/decryption overheads on an average by 46.5%, when DES is used as the encryption mechanism, and the same by 30.63%, when AES is used as the encryption mechanism.

languages, compilers, and tools for embedded systems | 2010

Compiler directed network-on-chip reliability enhancement for chip multiprocessors

Ozcan Ozturk; Mahmut T. Kandemir; Mary Jane Irwin; Sri Hari Krishna Narayanan

Chip multiprocessors (CMPs) are expected to be the building blocks for future computer systems. While architecting these emerging CMPs is a challenging problem on its own, programming them is even more challenging. As the number of cores accommodated in chip multiprocessors increases, network-on-chip (NoC) type communication fabrics are expected to replace traditional point-to-point buses. Most of the prior software related work so far targeting CMPs focus on performance and power aspects. However, as technology scales, components of a CMP are being increasingly exposed to both transient and permanent hardware failures. This paper presents and evaluates a compiler-directed power-performance aware reliability enhancement scheme for network-on-chip (NoC) based chip multiprocessors (CMPs). The proposed scheme improves on-chip communication reliability by duplicating messages traveling across CMP nodes such that, for each original message, its duplicate uses a different set of communication links as much as possible (to satisfy performance constraint). In addition, our approach tries to reuse communication links across the different phases of the program to maximize link shutdown opportunities for the NoC (to satisfy power constraint). Our results show that the proposed approach is very effective in improving on-chip network reliability, without causing excessive power or performance degradation. In our experiments, we also evaluate the performance oriented and energy oriented versions of our compiler-directed reliability enhancement scheme, and compare it to two pure hardware based fault tolerant routing schemes.

symposium on cloud computing | 2005

Workload Clustering for Increasing Energy Savings on Embedded MPSoCs

Sri Hari Krishna Narayanan; Ozcan Ozturk; Mahmut T. Kandemir; Mustafa Karaköy

Voltage/frequency scaling and processor low-power modes (i.e., processor shut-down) are two important mechanisms used for reducing energy consumption in embedded MPSoCs. While a unified scheme that combines these two mechanisms can achieve significant savings in some cases, such an approach is limited by the code parallelization strategy employed. In this paper, we propose an integer linear programming (ILP) based workload clustering strategy across parallel processors, oriented towards maximizing the number of idle processors without impacting original execution times. These idle processors can then be switched to a low power mode to maximize energy savings, whereas the remaining ones can make use of voltage/frequency scaling. In order to check whether this approach brings any energy benefits over the pure voltage scaling based, pure processor shut-down based, or a simple unified scheme, we implemented four different approaches and tested them using a set of eight array/loop-intensive embedded applications. Our simulation-based analysis reveals that the proposed ILP based approach: (1) is very effective in reducing the energy consumptions of the applications tested; and (2) generates much better energy savings than all the alternate schemes tested (including a unified scheme that combines voltage/frequency scaling and processor shutdown)

high performance embedded architectures and compilers | 2008

In-Network Caching for Chip Multiprocessors

Aditya Yanamandra; Mary Jane Irwin; Vijaykrishnan Narayanan; Mahmut T. Kandemir; Sri Hari Krishna Narayanan

Effective management of data is critical to the performance of emerging multi-core architectures. Our analysis of applications from SpecOMP reveal that a small fraction of shared addresses correspond to a large portion of accesses. Utilizing this observation, we propose a technique that augments a router in a on-chip network with a small data store to reduce the memory access latency of the shared data. In the proposed technique, shared data from read response packets that pass through the router are cached in its data store to reduce number of hops required to service future read requests. Our limit study reveals that such caching has the potential to reduce memory access latency on an average by 27%. Further, two practical caching strategies are shown to reduce memory access latency by 14% and 17% respectively with a data store of just four entries at 2.5% area overhead.

Explore More