Kaushik Rajan
Microsoft
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kaushik Rajan.
international symposium on microarchitecture | 2007
Kaushik Rajan; R. Govindarajan
The inherent temporal locality in memory accesses is filtered out by the L1 cache. As a consequence, an L2 cache with LRU replacement incurs significantly higher misses than the optimal replacement policy (OPT). We propose to narrow this gap through a novel replacement strategy that mimics the replacement decisions of OPT. The L2 cache is logically divided into two components, a Shepherd Cache (SC) with a simple FIFO replacement and a Main Cache (MC) with an emulation of optimal replacement. The SC plays the dual role of caching lines and guiding the replacement decisions in MC. Our proposed organization can cover 40% of the gap between OPT and LRU for a 2MB cache resulting in 7% overall speedup. Comparison with the dynamic insertion policy, a victim buffer, a V-Way cache and an LRU based fully associative cache demonstrates that our scheme performs better than all these strategies.
international symposium on computer architecture | 2012
R. Manikantan; Kaushik Rajan; R. Govindarajan
Effective sharing of the last level cache has a significant influence on the overall performance of a multicore system. We observe that existing solutions control cache occupancy at a coarser granularity, do not scale well to large core counts and in some cases lack the flexibility to support a variety of performance goals. In this paper, we propose Probabilistic Shared Cache Management (PriSM), a framework to manage the cache occupancy of different cores at cache block granularity by controlling their eviction probabilities. The proposed framework requires only simple hardware changes to implement, can scale to larger core count and is flexible enough to support a variety of performance goals. We demonstrate the flexibility of PriSM, by computing the eviction probabilities needed to achieve goals like hit-maximization, fairness and QOS. PriSM-HitMax improves performance by 18.7% over LRU and 11.8% over previously proposed schemes in a sixteen core machine. PriSM-Fairness improves fairness over existing solutions by 23.3% along with a performance improvement of 19.0%. PriSM-QOS successfully achieves the desired QOS targets.
high-performance computer architecture | 2011
R. Manikantan; Kaushik Rajan; R. Govindarajan
The effectiveness of the last-level shared cache is crucial to the performance of a multi-core system. In this paper, we observe and make use of the DelinquentPC — Next-Use characteristic to improve shared cache performance. We propose a new PC-centric cache organization, NUcache, for the shared last level cache of multi-cores. NUcache logically partitions the associative ways of a cache set into MainWays and DeliWays. While all lines have access to the MainWays, only lines brought in by a subset of delinquent PCs, selected by a PC selection mechanism, are allowed to enter the DeliWays. The PC selection mechanism is an intelligent cost-benefit analysis based algorithm that utilizes Next-Use information to select the set of PCs that can maximize the hits experienced in DeliWays. Performance evaluation reveals that NUcache improves the performance over a baseline design by 9.6%, 30% and 33% respectively for dual, quad and eight core workloads comprised of SPEC benchmarks. We also show that NUcache is more effective than other well-known cache-partitioning algorithms.
programming language design and implementation | 2011
Abhishek Udupa; Kaushik Rajan; William Thies
For decades, compilers have relied on dependence analysis to determine the legality of their transformations. While this conservative approach has enabled many robust optimizations, when it comes to parallelization there are many opportunities that can only be exploited by changing or re-ordering the dependences in the program. This paper presents Alter: a system for identifying and enforcing parallelism that violates certain dependences while preserving overall program functionality. Based on programmer annotations, Alter exploits new parallelism in loops by reordering iterations or allowing stale reads. Alter can also infer which annotations are likely to benefit the program by using a test-driven framework. Our evaluation of Alter demonstrates that it uncovers parallelism that is beyond the reach of existing static and dynamic tools. Across a selection of 12 performance-intensive loops, 9 of which have loop-carried dependences, Alter obtains an average speedup of 2.0x on 4 cores.
asian symposium on programming languages and systems | 2009
Rupesh Nasre; Kaushik Rajan; R. Govindarajan; Uday P. Khedker
Context-sensitive points-to analysis is critical for several program optimizations. However, as the number of contexts grows exponentially, storage requirements for the analysis increase tremendously for large programs, making the analysis non-scalable. We propose a scalable flow-insensitive context-sensitive inclusion-based points-to analysis that uses a specially designed multi-dimensional bloom filter to store the points-to information. Two key observations motivate our proposal: (i ) points-to information (between pointer-object and between pointer-pointer) is sparse, and (ii ) moving from an exact to an approximate representation of points-to information only leads to reduced precision without affecting correctness of the (may-points-to) analysis. By using an approximate representation a multi-dimensional bloom filter can significantly reduce the memory requirements with a probabilistic bound on loss in precision. Experimental evaluation on SPEC 2000 benchmarks and two large open source programs reveals that with an average storage requirement of 4MB, our approach achieves almost the same precision (98.6%) as the exact implementation. By increasing the average memory to 27MB, it achieves precision upto 99.7% for these benchmarks. Using Mod/Ref analysis as the client, we find that the client analysis is not affected that often even when there is some loss of precision in the points-to representation. We find that the NoModRef percentage is within 2% of the exact analysis while requiring 4MB (maximum 15MB) memory and less than 4 minutes on average for the points-to analysis. Another major advantage of our technique is that it allows to trade off precision for memory usage of the analysis.
acm sigplan symposium on principles and practice of parallel programming | 2010
Sandya Srivilliputtur Mannarswamy; Dhruva R. Chakrabarti; Kaushik Rajan; Sujoy Saraswati
Atomic sections have been recently introduced as a language construct to improve the programmability of concurrent software. They simplify programming by not requiring the explicit specification of locks for shared data. Typically atomic sections are supported in software either through the use of optimistic concurrency by using transactional memory or through the use of pessimistic concurrency using compiler-assigned locks. As a software transactional memory (STM) system does not take advantage of the specific memory access patterns of an application it often suffers from false conflicts and high validation overheads. On the other hand, the compiler usually ends up assigning coarse grain locks as it relies on whole program points-to analysis which is conservative by nature. This adversely affects performance by limiting concurrency. In order to mitigate the disadvantages associated with STMs lock assignment scheme, we propose a hybrid approach which combines STMs lock assignment with a compiler aided selective lock assignment scheme (referred to as SCLA-STM). SCLA-STM overcomes the inefficiencies associated with a purely compile-time lock assignment approach by (i) using the underlying STM for shared variables where only a conservative analysis is possible by the compiler (e.g., in the presence of may-alias points to information) and (ii) being selective about the shared data chosen for the compiler-aided lock assignment. We describe our prototype SCLA-STM scheme implemented in the hp-ux IA-64 C/C++ compiler, using TL2 as our STM implementation. We show that SCLA-STM improves application performance for certain STAMP benchmarks from 1.68% to 37.13%.
principles of distributed computing | 2012
Jose M. Faleiro; Sriram K. Rajamani; Kaushik Rajan; G. Ramalingam; Kapil Vaswani
Lattice agreement is a key decision problem in distributed systems. In this problem, processes start with input values from a lattice, and must learn (non-trivial) values that form a chain. Unlike consensus, which is impossible in the presence of even a single process failure, lattice agreement has been shown to be decidable in the presence of failures. In this paper, we consider lattice agreement problems in asynchronous, message passing systems. We present an algorithm for the lattice agreement problem that guarantees liveness as long as a majority of the processes are non-faulty. The algorithm has a time complexity of O(N) message delays, where N is the number of processes. We then introduce the generalized lattice agreement problem, where each process receives a (potentially unbounded) sequence of values from an infinite lattice and must learn a sequence of increasing values such that the union of all learnt sequences is a chain and every proposed value is eventually learnt. We present a wait-free algorithm for solving generalized lattice agreement. The algorithm guarantees that every value received by a correct process is learnt in O(N) message delays. We show that this algorithm can be used to implement a class of replicated state machines where (a) commands can be classified as reads and updates, and (b) all update commands commute. This algorithm can be used to realize serializable and linearizable replicated versions of commonly used data types.
international conference on supercomputing | 2005
Kaushik Rajan; R. Govindarajan
As network traffic continues to increase and with the requirement to process packets at line rates, high performance routers need to forward millions of packets every second. Even with an efficient lookup algorithm like the LC-trie, each packet needs upto 5 memory accesses. Earlier work shows that a single cache for the nodes of an LC-trie can reduce the number of external memory accesses.We observe that the locality characteristics of the level-one nodes of an LC-trie are significantly different from those of lower-level nodes. Hence, we propose a heterogeneously segmented cache architecture (HSCA) which uses separate caches for level-one and lower-level nodes each with carefully chosen sizes. We further improve the hit rate of the level-one nodes cache by introducing a weight-based replacement policy and an intelligent index bit selection scheme. To evaluate our cache scheme with realistic traces, we propose a synthetic trace generation method which emulates real traces and can generate traces with varying locality characteristics. The base HSCA scheme gives us upto 16% reduction in misses over the unified scheme. The optimizations further enhance this improvement to upto 25% for core router traces.
symposium on cloud computing | 2016
Kaushik Rajan; Dharmesh Kakadia; Carlo Curino; Subru Krishnan
Query Optimization focuses on finding the best query execution plan, given fixed hardware resources. In BigData settings, both pay-as-you-go clouds and on-prem shared clusters, a complementary challenge emerges: Resource Optimization: find the best hardware resources, given an execution plan. In this world, provisioning is almost instantaneous and time-varying resources can be acquired on a per-query basis. This allows us to optimize allocations for completion time, resource usage, dollar cost, etc. These optimizations have a huge impact on performance and cost, and pivot around a core challenge: faithful resource-to-performance models for arbitrary BigData queries. This task is challenging for users and tools alike due to lack of good statistics (high-velocity, unstructured data), frequent use of UDFs, impact on performance of different hardware types and a lack of understanding of parallel execution at such a scale. We address this with PerfOrator, a novel approach to resource-to-performance modeling. PerfOrator employs nonlinear regression on profile runs to model arbitrary UDFs, calibration queries to generalize across hardware platforms, and analytical framework models to account for parallelism. The resulting estimates are orders of magnitude more accurate than existing approaches (e.g, Hives optimizer), and have been successfully employed in two resource optimization scenarios: 1) optimize provisioning of clusters in cloud settings---with decisions within 1% of optimal, 2) reserve skyline of resources for SLA jobs---with accuracies over 10x better than human experts.
international conference on parallel architectures and compilation techniques | 2006
Kaushik Rajan; R. Govindarajan
Packet forwarding is a memory-intensive application requiring multiple accesses through a trie structure. The efficiency of a cache for this application critically depends on the placement function to reduce conflict misses. Traditional placement functions use a one-level mapping that naively partitions trie-nodes into cache sets. However, as a significant percentage of trie nodes are not useful, these schemes suffer from a non-uniform distribution of useful nodes to sets. This in turn results in increased conflict misses. Newer organizations such as variable associativity caches achieve flexibility in placement at the expense of increased hit-latency. This makes them unsuitable for L1 caches. We propose a novel two-level mapping framework that retains the hit-latency of one-level mapping yet incurs fewer conflict misses. This is achieved by introducing a second-level mapping which reorganizes the nodes in the naive initial partitions into refined partitions with near-uniform distribution of nodes. Further as this remapping is accomplished by simply adapting the index bits to a given routing table the hit-latency is not affected. We propose three new schemes which result in up to 16% reduction in the number of misses and 13% speedup in memory access time. In comparison, an XOR-based placement scheme known to perform extremely well for general purpose architectures, can obtain up to 2% speedup in memory access time.