Dhruva R. Chakrabarti

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dhruva R. Chakrabarti is active.

Explore More

Publication

Featured researches published by Dhruva R. Chakrabarti.

conference on object-oriented programming systems, languages, and applications | 2014

Atlas: leveraging locks for non-volatile memory consistency

Dhruva R. Chakrabarti; Hans-Juergen Boehm; Kumud Bhandari

Non-volatile main memory, such as memristors or phase change memory, can revolutionize the way programs persist data. In-memory objects can themselves be persistent without the need for a separate persistent data storage format. However, the challenge is to ensure that such data remains consistent if a failure occurs during execution. In this paper, we present our system, called Atlas, which adds durability semantics to lock-based code, typically allowing us to automatically maintain a globally consistent state even in the presence of failures. We identify failure-atomic sections of code based on existing critical sections and describe a log-based implementation that can be used to recover a consistent state after a failure. We discuss several subtle semantic issues and implementation tradeoffs. We confirm the ability to rapidly flush CPU caches as a core implementation bottleneck and suggest partial solutions. Experimental results confirm the practicality of our approach and provide insight into the overheads of such a system.

symposium on code generation and optimization | 2006

Practical Structure Layout Optimization and Advice

Robert Hundt; Sandya Srivilliputtur Mannarswamy; Dhruva R. Chakrabarti

With the delta between processor clock frequency and memory latency ever increasing and with the standard locality improving transformations maturing, compilers increasingly seek to modify an applications data layout to improve spatial and temporal locality and to reduce cache miss and page fault penalties. In this paper, we describe a practical implementation of the data layout optimizations structure splitting, structure peeling, structure field reordering and dead field removal, both for profile and non-profile based compilations. We demonstrate significant performance gains, but find that automatic transformations fail for a relatively high number of record types because of legality violations or profitability constraints. Additionally, we find a class of desirable transformations for which the framework cannot provide satisfying results. To address this issue, we complement the automatic transformations with an advisory tool. We reuse the compiler analysis done for automatic transformation and correlate its results with performance data collected during runtime for structure fields, such as data cache misses and latencies. We then use the compiler as a performance analysis and reporting tool and provide insight into how to layout structure types more efficiently.

acm sigplan symposium on principles and practice of parallel programming | 2009

A compiler-directed data prefetching scheme for chip multiprocessors

Seung Woo Son; Mahmut T. Kandemir; Mustafa Karaköy; Dhruva R. Chakrabarti

Data prefetching has been widely used in the past as a technique for hiding memory access latencies. However, data prefetching in multi-threaded applications running on chip multiprocessors (CMPs) can be problematic when multiple cores compete for a shared on-chip cache (L2 or L3). In this paper, we (i) quantify the impact of conventional data prefetching on shared caches in CMPs. The experimental data collected using multi-threaded applications indicates that, while data prefetching improves performance in small number of cores, its benefits reduce significantly as the number of cores is increased, that is, it is not scalable; (ii) identify harmful prefetches as one of the main contributors for degraded performance with a large number of cores; and (iii) propose and evaluate a compiler-directed data prefetching scheme for shared on-chip cache based CMPs. The proposed scheme first identifies program phases using static compiler analysis, and then divides the threads into groups within each phase and assigns a customized prefetcher thread (helper thread) to each group of threads. This helps to reduce the total number of prefetches issued, prefetch overheads, and negative interactions on the shared cache space due to data prefetches, and more importantly, makes compiler-directed prefetching a scalable optimization for CMPs. Our experiments with the applications from the SPEC OMP benchmark suite indicate that the proposed scheme improves overall parallel execution latency by 18.3% over the no-prefetch case and 6.4% over the conventional data prefetching scheme (where each core prefetches its data independently), on average, when 12 cores are used. The corresponding average performance improvements with 24 cores are 16.4% (over the no-prefetch case) and 11.7% (over the conventional prefetching case). We also demonstrate that the proposed scheme is robust under a wide range of values of our major simulation parameters, and the improvements it achieves come very close to those that can be achieved using an optimal scheme.

ieee international symposium on workload characterization | 2010

Analysis on semantic transactional memory footprint for hardware transactional memory

Jaewoong Chung; Dhruva R. Chakrabarti; Chi Cao Minh

We analyze various characteristics of semantic transactional memory footprint (STMF) that consists of only the memory accesses the underlying hardware transactional memory (HTM) system has to manage for the correct execution of transactional programs. Our analysis shows that STMF can be significantly smaller than declarative transactional memory footprint (DTMF) that contains all memory accesses within transaction boundaries (i.e., only 8.3% of DTMF in the applications examined). This result encourages processor designers and software toolchain developers to explore new design points for low-cost HTM systems and intelligent software toolchains to find and leverage STMF efficiently. We identify seven code patterns that belong to DTMF, but not to STMF, and show that they take up 91.7% of all memory accesses in transactional boundaries, on average, for the transactional programs examined. A new instruction prefix is proposed to express STMF efficiently, and the existing compiler techniques are examined to check their applicability to deduce STMF from DTMF. Our trace analysis shows that using STMF significantly reduces the ratio of transactions overflowing a 32KB L1 cache, from 12.80% to 2.00%, and substantially lowers the false positive probability of Bloom filters used for transaction signature management, from 23.60% to less than 0.001%. The simulation result shows that the STAMP applications with the STMF expression run 40% faster on average than those with the DTMF expression.

conference on object oriented programming systems languages and applications | 2016

Makalu: fast recoverable allocation of non-volatile memory

Kumud Bhandari; Dhruva R. Chakrabarti; Hans-Juergen Boehm

Byte addressable non-volatile memory (NVRAM) is likely to supplement, and perhaps eventually replace, DRAM. Applications can then persist data structures directly in memory instead of serializing them and storing them onto a durable block device. However, failures during execution can leave data structures in NVRAM unreachable or corrupt. In this paper, we present Makalu, a system that addresses non-volatile memory management. Makalu offers an integrated allocator and recovery-time garbage collector that maintains internal consistency, avoids NVRAM memory leaks, and is efficient, all in the face of failures. We show that a careful allocator design can support a less restrictive and a much more familiar programming model than existing persistent memory allocators. Our allocator significantly reduces the per allocation persistence overhead by lazily persisting non-essential metadata and by employing a post-failure recovery-time garbage collector. Experimental results show that the resulting online speed and scalability of our allocator are comparable to well-known transient allocators, and significantly better than state-of-the-art persistent allocators.

Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness | 2011

Extended sequential reasoning for data-race-free programs

Laura Effinger-Dean; Hans-Juergen Boehm; Dhruva R. Chakrabarti; Pramod G. Joisha

Most multithreaded programming languages prohibit or discourage data races. By avoiding data races, we are guaranteed that variables accessed within a synchronization-free code region cannot be modified by other threads, allowing us to reason about such code regions as though they were single-threaded. However, such single-threaded reasoning is not limited to synchronization-free regions. We present a simple characterization of extended interference-free regions in which variables cannot be modified by other threads. This characterization shows that, in the absence of data races, important code analysis problems often have surprisingly easy answers. For instance, we can use local analysis to determine when lock and unlock calls refer to the same mutex. Our characterization can be derived from prior work on safe compiler transformations, but it can also be simply derived from first principles, and justified in a very broad context. In addition, systematic reasoning about overlapping interference-free regions yields insight about optimization opportunities that were not previously apparent. We give preliminary results for a prototype implementation of interference-free regions in the LLVM compiler infrastructure. We also discuss other potential applications for interference-free regions.

international symposium on memory management | 2016

Persistence programming models for non-volatile memory

Hans-Juergen Boehm; Dhruva R. Chakrabarti

It is expected that DRAM memory will be augmented, and perhaps eventually replaced, by one of several up-and-coming memory technologies. These are all non-volatile, in that they retain their contents without power. This allows primary memory to be used as a fast disk replacement. It also enables more aggressive programming models that directly leverage persistence of primary memory. However, it is challenging to maintain consistency of memory in such an environment. There is no consensus on the right programming model for doing so, and subtle differences can have large, and sometimes surprising, effects on the implementation and its performance. The existing literature describes multiple programming systems that provide point solutions to the selective persistence for user data structures. Real progress in this area requires a choice of programming model, which we cannot reasonably make without a real understanding of the design space. Point solutions are insufficient. We systematically explore what we consider to be the most promising part of the space, precisely defining semantics and identifying implementation costs. This allows us to be much more explicit and precise about semantic and implementation trade-offs that were usually glossed over in prior work. It also exposes some promising new design alternatives.

symposium on code generation and optimization | 2004

SYZYGY - a framework for scalable cross-module IPO

Sungdo Moon; Xinliang D. Li; Robert Hundt; Dhruva R. Chakrabarti; Luis A. Lozano; Uma Srinivasan; Shin-Ming Liu

Performing analysis across module boundaries for an entire program is important for exploiting several runtime performance opportunities. However, due to scalability problems in existing full-program analysis frameworks, such performance opportunities are only realized by paying tremendous compile-time costs. Alternative solutions, such as partial compilations or user assertions, are complicated or unsafe and as a result, not many commercial applications are compiled today with cross-module optimizations. We present SYZYGY, a practical framework for performing efficient, scalable, interprocedural optimizations. The framework is implemented in the HP-UX Itanium/spl reg/ compilers and we have successfully compiled many very large applications consisting of millions of lines of code. We achieved performance improvements of up to 40% over optimization level two and compilation time improvements in the order of 100% and more compared to a previous approach.

symposium on code generation and optimization | 2006

Inline Analysis: Beyond Selection Heuristics

Dhruva R. Chakrabarti; Shin-Ming Liu

Research on procedure inlining has mainly focused on heuristics that decide whether inlining a particular call-site maximizes application performance. However, other equally important aspects of inline analysis such as call-site analysis order, indirect effects of inlining, and selection of the most profitable version of a procedure warrant more attention. This paper evaluates a number of different sequences in which call-sites are examined for inlining and shows that choosing the correct order is crucial to obtaining the best run-time performance. We then present a novel, work-list-based, and updated sequence that achieves the best results. While applying cross-module inline analysis on large applications with thousands of files and millions of lines of code, we separate the analysis from the transformation phase and allow the former to work solely on summary information in order to reduce compile-time and memory consumption. A focus of this paper is to enumerate the summaries that our compiler maintains, present a technique to compute the goodness factor on which the work-list sequence is based, and describe methods to continuously update the summaries as and when a call-site is accepted for inlining. We then introduce inline specialization, a new technique that facilitates inlining into call chains selectively. The power of inline specialization lies in its ability to choose the most profitable version of the called procedure without having to maintain multiple versions at any point of time. We discuss implementation of these techniques in the HPUX Itanium/spl reg/ production compiler and present experimental results showing that a dynamic work-list based analysis order, comprehensive summary updates, and inline specialization significantly improve performance of applications.

acm sigplan symposium on principles and practice of parallel programming | 2010

Compiler aided selective lock assignment for improving the performance of software transactional memory

Sandya Srivilliputtur Mannarswamy; Dhruva R. Chakrabarti; Kaushik Rajan; Sujoy Saraswati

Atomic sections have been recently introduced as a language construct to improve the programmability of concurrent software. They simplify programming by not requiring the explicit specification of locks for shared data. Typically atomic sections are supported in software either through the use of optimistic concurrency by using transactional memory or through the use of pessimistic concurrency using compiler-assigned locks. As a software transactional memory (STM) system does not take advantage of the specific memory access patterns of an application it often suffers from false conflicts and high validation overheads. On the other hand, the compiler usually ends up assigning coarse grain locks as it relies on whole program points-to analysis which is conservative by nature. This adversely affects performance by limiting concurrency. In order to mitigate the disadvantages associated with STMs lock assignment scheme, we propose a hybrid approach which combines STMs lock assignment with a compiler aided selective lock assignment scheme (referred to as SCLA-STM). SCLA-STM overcomes the inefficiencies associated with a purely compile-time lock assignment approach by (i) using the underlying STM for shared variables where only a conservative analysis is possible by the compiler (e.g., in the presence of may-alias points to information) and (ii) being selective about the shared data chosen for the compiler-aided lock assignment. We describe our prototype SCLA-STM scheme implemented in the hp-ux IA-64 C/C++ compiler, using TL2 as our STM implementation. We show that SCLA-STM improves application performance for certain STAMP benchmarks from 1.68% to 37.13%.

Explore More