Sai Prashanth Muralidhara

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sai Prashanth Muralidhara is active.

Explore More

Publication

Featured researches published by Sai Prashanth Muralidhara.

international symposium on microarchitecture | 2009

Optimizing shared cache behavior of chip multiprocessors

Mahmut T. Kandemir; Sai Prashanth Muralidhara; Sri Hari Krishna Narayanan; Yuanrui Zhang; Ozcan Ozturk

One of the critical problems associated with emerging chip multiprocessors (CMPs) is the management of on-chip shared cache space. Unfortunately, single processor centric data locality optimization schemes may not work well in the CMP case as data accesses from multiple cores can create conflicts in the shared cache space. The main contribution of this paper is a compiler directed code restructuring scheme for enhancing locality of shared data in CMPs. The proposed scheme targets the last level shared cache that exist in many commercial CMPs and has two components, namely, allocation, which determines the set of loop iterations assigned to each core, and scheduling, which determines the order in which the iterations assigned to a core are executed. Our scheme restructures the application code such that the different cores operate on shared data blocks at the same time, to the extent allowed by data dependencies. This helps to reduce reuse distances for the shared data and improves on-chip cache performance. We evaluated our approach using the Splash-2 and Parsec applications through both simulations and experiments on two commercial multi-core machines. Our experimental evaluation indicates that the proposed data locality optimization scheme improves inter-core conflict misses in the shared cache by 67% on average when both allocation and scheduling are used. Also, the execution time improvements we achieve (29% on average) are very close to the optimal savings that could be achieved using a hypothetical scheme.

international parallel and distributed processing symposium | 2010

Intra-application cache partitioning

Sai Prashanth Muralidhara; Mahmut T. Kandemir; Padma Raghavan

Efficient management of shared on-chip resources such as the shared level 2 (L2) cache has become an important problem with the emergence of chip multiprocessors (CMPs). Partitioning the shared cache in chip multiprocessors (CMPs) among concurrently executing applications can provide important benefits such as throughput improvement, fairness guarantees, and quality of service (QoS) enhancements. We pose an interesting related question, which is, if partitioning the shared cache space among concurrently executing threads of the same application can enhance the application performance. We address this problem by identifying and speeding up the slowest thread, also termed as the critical path thread, during each execution interval since the overall performance of a multithreaded application is determined by the critical path thread. To do so, we propose a dynamic, runtime system based, cache partitioning scheme that partitions the shared cache space dynamically among the individual threads of a given application. In a nutshell, we wish to take some cache space away from the faster threads and give it to the critical path thread at each execution interval. We show that speeding up the critical path thread this way, results in overall performance enhancement of the application execution in the long term. Our experimental evaluation indicates that, the proposed dynamic cache partitioning scheme yields benefits up to 15% over a shared cache with no partitions, up to 23% over a statically partitioned cache (private cache) and up to 20% over a throughput-oriented scheme.

design automation conference | 2009

Dynamic thread and data mapping for NoC based CMPs

Mahmut T. Kandemir; Ozcan Ozturk; Sai Prashanth Muralidhara

Thread mapping and data mapping are two important problems in the context of NoC (network-on-chip) based CMPs (chip multiprocessors). While a compiler can determine suitable mappings for data and threads, such static mappings may not work well for multithreaded applications that go through different execution phases during their execution, each phase with potentially different data access patterns than others. Instead, a dynamic mapping strategy, if its overheads can be kept low, may be a more promising option. In this work, we present dynamic (runtime) thread and data mappings for NoC based CMPs. The goal of these mappings is to reduce the distance between the location of the core that requests data and the core whose local memory contains that requested data. In our experiments, we evaluate our proposed thread mapping and data mapping in isolation as well as in an integrated manner.

design, automation, and test in europe | 2010

A special-purpose compiler for look-up table and code generation for function evaluation

Yuanrui Zhang; Lanping Deng; Praveen Yedlapalli; Sai Prashanth Muralidhara; Hui Zhao; Mahmut T. Kandemir; Chaitali Chakrabarti; Nikos P. Pitsianis; Xiaobai Sun

Elementary functions are extensively used in computer graphics, signal and image processing, and communication systems. This paper presents a special-purpose compiler that automatically generates customized look-up tables and implementations for elementary functions under user given constraints. The generated implementations include a C/C++ code that can be used directly by applications running on multicores, as well as a MATLAB-like code that can be translated directly to a hardware module on FPGA platforms. The experimental results show that our solutions for function evaluation bring significant performance improvements to applications on multicores as well as significant resource savings to designs on FPGAs.

international conference on parallel architectures and compilation techniques | 2008

Profiler and compiler assisted adaptive I/O prefetching for shared storage caches

Seung Woo Son; Sai Prashanth Muralidhara; Ozcan Ozturk; Mahmut T. Kandemir; Ibrahim Kolcu; Mustafa Karaköy

I/O prefetching has been employed in the past as one of the mechanisms to hide large disk latencies. However, I/O prefetching in parallel applications is problematic when multiple CPUs share the same set of disks due to the possibility that prefetches from different CPUs can interact on shared memory caches in the I/O nodes in complex and unpredictable ways. In this paper, we (i) quantify the impact of compiler-directed I/O prefetching - developed originally in the context of sequential execution - on shared caches at I/O nodes. The experimental data collected shows that while I/O prefetching brings benefits, its effectiveness reduces significantly as the number of CPUs is increased; (ii) identify inter-CPU misses due to harmful prefetches as one of the main sources for this reduction in performance with the increased number of CPUs; and (iii) propose and experimentally evaluate a profiler and compiler assisted adaptive I/O prefetching scheme targeting shared storage caches. The proposed scheme obtains inter-thread data sharing information using profiling and, based on the captured data sharing patterns, divides the threads into clusters and assigns a separate (customized) I/O prefetcher thread for each cluster. In our approach, the compiler generates the I/O prefetching threads automatically. We implemented this new I/O prefetching scheme using a compiler and the PVFS file system running on Linux, and the empirical data collected clearly underline the importance of adapting I/O prefetching based on program phases. Specifically, our proposed scheme improves performance, on average, by 19.9%, 11.9% and 10.3% over the cases without I/O prefetching, with independent I/O prefetching (each CPU is performing compiler-directed I/O prefetching independently), and with one CPU prefetching (one CPU is reserved for prefetching on behalf of others), respectively, when 8 CPUs are used.

high performance distributed computing | 2010

Computation mapping for multi-level storage cache hierarchies

Mahmut T. Kandemir; Sai Prashanth Muralidhara; Mustafa Karaköy; Seung Woo Son

Improving I/O performance is an important issue for many data-intensive, large-scale parallel applications. Although storage caches are used for improving I/O latencies of parallel applications, most of the prior work has focused on the management and partitioning of cache space. In particular, the compilers role in taking advantage of multilevel storage caches has been largely unexplored. The main contribution of this paper is a shared-storage, cache-aware loop iteration distribution (iteration-to-processor mapping) scheme for I/O-intensive applications that manipulate disk-resident data sets. The proposed scheme is compiler directed and can be tuned to target any multilevel storage cache hierarchy. At the core of our scheme lies an iterative strategy that clusters loop iterations based on the underlying storage cache hierarchy and on the way these different storage caches in the hierarchy are shared by different processors. We tested this mapping scheme using a set of eight I/O-intensive application programs. The results collected so far are promising. Our proposed scheme improves the I/O performance of the tested applications by 26.3% on average, and this improvement leads to an average 18.9% reduction in the overall execution latencies of these applications. Moreover, our scheme performs significantly better than a state-of-the-art (but storage-cache- hierarchy agnostic) data locality optimization scheme. We also present an enhancement to our baseline implementation that performs local scheduling once the loop iteration distribution is performed. We observe that applying this enhancement improves I/O latency and total execution time further by 30.7% and 21.9%, respectively.

acm sigplan symposium on principles and practice of parallel programming | 2010

Intra-application shared cache partitioning for multithreaded applications

Sai Prashanth Muralidhara; Mahmut T. Kandemir; Padma Raghavan

In this paper, we address the problem of partitioning a shared cache when the executing threads belong to the same application.

international conference on parallel processing | 2011

Bandwidth constrained coordinated HW/SW prefetching for multicores

Sai Prashanth Muralidhara; Mahmut T. Kandemir; Yuanrui Zhang

Prefetching is a highly effective latency hiding technique that can greatly improve application performance. However, aggressive prefetching can potentially stress the off-chip bandwidth. The resulting bandwidth stalls can potentially negate the performance gain due to prefetching. In this paper, focusing on a multicore environment, we first study the comparative benefits of hardware and software prefetching and analyze if the two are complimentary or redundant. This analysis also evaluates different aggressiveness levels of hardware prefetching. Secondly, we weigh the positive performance benefits of prefetching against the negative performance effects of bandwidth stalls. Thirdly, we propose a hierarchical prefetch management scheme for multicores that controls the prefetch levels such that the overall performance gain is improved. Lastly, we show that our proposed off-chip bandwidth aware prefetch management scheme is very effective in practice, leading to performance gains of upto about 10% in system throughput over a bandwidth agnostic prefetching scheme.

computing frontiers | 2012

Reuse distance based performance modeling and workload mapping

Sai Prashanth Muralidhara; Mahmut T. Kandemir; Orhan Kislal

Modern multicore architectures have multiple cores connected to a hierarchical cache structure resulting in heterogeneity in cache sharing across different subsets of cores. In these systems, overall throughput and efficiency depends heavily on a careful mapping of applications to available cores. In this paper, we study the problem of application-to-core mapping with the goal of trying to improve the overall cache performance in the presence of a hierarchical multi-level cache structure. We propose to sample the memory access patterns of individual applications and build their reuse distance distributions. Further, we propose to use these reuse distance distributions to compute an application-to-core mapping that tries to improve the overall cache performance, and consequently, the overall throughput. We show that our proposed mapping scheme is very effective in practice yielding throughput benefits of about 39% over the worst case mapping and about 30% over the default operating system based mapping. We believe, as larger chip multiprocessors with deeper cache hierarchies are projected to be the norm in the future, efficient mapping of applications to cores will become a vital requirement to extract the maximum possible performance from these systems.

european conference on parallel processing | 2010

Code scheduling for optimizing parallelism and data locality

Taylan Yemliha; Mahmut T. Kandemir; Ozcan Ozturk; Emre Kultursay; Sai Prashanth Muralidhara

As chip multiprocessors proliferate, programming support for these devices is likely to receive a lot of attention in the near future. Parallelism and data locality are two critical issues in a chip multiprocessor environment. Unfortunately, most of the published work in the literature focuses only on one of these problems, and this can prevent one from achieving the best possible performance. The main goal of this paper is to propose and evaluate a compiler-directed code parallelization scheme, which considers both parallelism and data locality at the same time. Our compiler captures the inherent parallelism and data reuse in the application code being analyzed using a novel representation called the locality-parallelism graph (LPG). Our partitioning/scheduling algorithm assigns the nodes of this graph to the processors in the architecture and schedules them for execution. We implemented this algorithm and evaluated its effectiveness using a set of benchmark codes. The results collected so far indicate that our approach improves overall execution latency significantly. In this paper, we also introduce an ILP (Integer Linear Programming) based formulation of the problem, and implement the schedule obtained by the ILP solver. The results indicate that our approach gets within 4% of the ILP solution.

Explore More