Sandya Srivilliputtur Mannarswamy

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sandya Srivilliputtur Mannarswamy is active.

Explore More

Publication

Featured researches published by Sandya Srivilliputtur Mannarswamy.

symposium on code generation and optimization | 2006

Practical Structure Layout Optimization and Advice

Robert Hundt; Sandya Srivilliputtur Mannarswamy; Dhruva R. Chakrabarti

With the delta between processor clock frequency and memory latency ever increasing and with the standard locality improving transformations maturing, compilers increasingly seek to modify an applications data layout to improve spatial and temporal locality and to reduce cache miss and page fault penalties. In this paper, we describe a practical implementation of the data layout optimizations structure splitting, structure peeling, structure field reordering and dead field removal, both for profile and non-profile based compilations. We demonstrate significant performance gains, but find that automatic transformations fail for a relatively high number of record types because of legality violations or profitability constraints. Additionally, we find a class of desirable transformations for which the framework cannot provide satisfying results. To address this issue, we complement the automatic transformations with an advisory tool. We reuse the compiler analysis done for automatic transformation and correlate its results with performance data collected during runtime for structure fields, such as data cache misses and latencies. We then use the compiler as a performance analysis and reporting tool and provide insight into how to layout structure types more efficiently.

symposium on code generation and optimization | 2007

Structure Layout Optimization for Multithreaded Programs

Easwaran Raman; Robert Hundt; Sandya Srivilliputtur Mannarswamy

Structure layout optimizations seek to improve runtime performance by improving data locality and reuse. The structure layout heuristics for single-threaded benchmarks differ from those for multi-threaded applications running on multiprocessor machines, where the effects of false sharing need to be taken into account. In this paper we propose a technique for structure layout transformations for multithreaded applications that optimizes both for improved spatial locality and reduced false sharing, simultaneously. We develop a semi-automatic tool that produces actual structure layouts for multi-threaded programs and outputs the key factors contributing to the layout decisions. We apply this tool on the HP-UX kernel and demonstrate the effects of these transformations for a variety of already highly hand-tuned key structures with different set of properties. We show that naive heuristics can result in massive performance degradations on such a highly tuned application, while our technique generally avoids those pitfalls. The improved structures produced by our tool improve performance by up to 3.2% over a highly tuned baseline

international conference on parallel architectures and compilation techniques | 2011

Making STMs Cache Friendly with Compiler Transformations

Sandya Srivilliputtur Mannarswamy; R. Govindarajan

Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs. In order for STMs to be adopted widely for performance critical software, understanding and improving the cache performance of applications running on STM becomes increasingly crucial, as the performance gap between processor and memory continues to grow. In this paper, we present the most detailed experimental evaluation to date, of the cache behavior of STM applications and quantify the impact of the different STM factors on the cache misses experienced by the applications. We find that STMs are not cache friendly, with the data cache stall cycles contributing to more than 50% of the execution cycles in a majority of the benchmarks. We find that on an average, misses occurring inside the STM account for 62% of total data cache miss latency cycles experienced by the applications and the cache performance is impacted adversely due to certain inherent characteristics of the STM itself. The above observations motivate us to propose a set of specific compiler transformations targeted at making the STMs cache friendly. We find that STMs fine grained and application unaware locking is a major contributor to its poor cache behavior. Hence we propose selective Lock Data co-location (LDC) and Redundant Lock Access Removal (RLAR) to address the lock access misses. We find that even transactions that are completely disjoint access parallel, suffer from costly coherence misses caused by the centralized global time stamp updates and hence we propose the Selective Per-Partition Time Stamp (SPTS) transformation to address this. We show that our transformations are effective in improving the cache behavior of STM applications by reducing the data cache miss latency by 20.15% to 37.14% and improving execution time by 18.32% to 33.12% in five of the 8 STAMP applications.

acm sigplan symposium on principles and practice of parallel programming | 2010

Compiler aided selective lock assignment for improving the performance of software transactional memory

Sandya Srivilliputtur Mannarswamy; Dhruva R. Chakrabarti; Kaushik Rajan; Sujoy Saraswati

Atomic sections have been recently introduced as a language construct to improve the programmability of concurrent software. They simplify programming by not requiring the explicit specification of locks for shared data. Typically atomic sections are supported in software either through the use of optimistic concurrency by using transactional memory or through the use of pessimistic concurrency using compiler-assigned locks. As a software transactional memory (STM) system does not take advantage of the specific memory access patterns of an application it often suffers from false conflicts and high validation overheads. On the other hand, the compiler usually ends up assigning coarse grain locks as it relies on whole program points-to analysis which is conservative by nature. This adversely affects performance by limiting concurrency. In order to mitigate the disadvantages associated with STMs lock assignment scheme, we propose a hybrid approach which combines STMs lock assignment with a compiler aided selective lock assignment scheme (referred to as SCLA-STM). SCLA-STM overcomes the inefficiencies associated with a purely compile-time lock assignment approach by (i) using the underlying STM for shared variables where only a conservative analysis is possible by the compiler (e.g., in the presence of may-alias points to information) and (ii) being selective about the shared data chosen for the compiler-aided lock assignment. We describe our prototype SCLA-STM scheme implemented in the hp-ux IA-64 C/C++ compiler, using TL2 as our STM implementation. We show that SCLA-STM improves application performance for certain STAMP benchmarks from 1.68% to 37.13%.

international conference on parallel architectures and compilation techniques | 2006

Whole-program optimization of global variable layout

Nathaniel McIntosh; Sandya Srivilliputtur Mannarswamy; Robert Hundt

On machines with high-performance processors, the memory system continues to be a performance bottleneck. Compilers insert prefetch operations and reorder data accesses to improve locality, but increasingly seek to modify an applications data layout to reduce cache miss and page fault penalties. In this paper we discuss Global Variable Layout (GVL), an optimization of the placement of entire static global data objects in the binary. We describe two practical methods for GVL in the HP-UX Integrity optimizing compiler for the Itanium

international parallel and distributed processing symposium | 2011

Variable Granularity Access Tracking Scheme for Improving the Performance of Software Transactional Memory

Sandya Srivilliputtur Mannarswamy; R. Govindarajan

Software transactional memory (STM) has been proposed as a promising programming paradigm for shared memory multi-threaded programs as an alternative to conventional lock based synchronization primitives. Typical STM implementations employ a conflict detection scheme, which works with uniform access granularity, tracking shared data accesses either at word/cache line or at object level. It is well known that a single fixed access tracking granularity cannot meet the conflicting goals of reducing false conflicts without impacting concurrency adversely. A fine grained granularity while improving concurrency can have an adverse impact on performance due to lock aliasing, lock validation overheads, and additional cache pressure. On the other hand, a coarse grained granularity can impact performance due to reduced concurrency. Thus, in general, a fixed or uniform granularity access tracking (UGAT) scheme is application-unaware and rarely matches the access patterns of individual application or parts of an application, leading to sub-optimal performance for different parts of the application(s). In order to mitigate the disadvantages associated with UGAT scheme, we propose a Variable Granularity Access Tracking (VGAT) scheme in this paper. We propose a compiler based approach wherein the compiler uses inter-procedural whole program static analysis to select the access tracking granularity for different shared data structures of the application based on the applications data access pattern. We describe our prototype VGAT scheme, using TL2 as our STM implementation. Our experimental results reveal that VGAT-STM scheme can improve the application performance of STAMP benchmarks from 1.87% to up to 21.2%.

international conference on parallel architectures and compilation techniques | 2009

Region Based Structure Layout Optimization by Selective Data Copying

Sandya Srivilliputtur Mannarswamy; R. Govindarajan; Rishi Surendran

As the gap between processor and memory continues to grow Memory performance becomes a key performance bottleneck for many applications. Compilers therefore increasingly seek to modify an application’s data layout to improve cache locality and cache reuse. Whole program Structure Layout [WPSL] transformations can significantly increase the spatial locality of data and reduce the runtime of programs that use link-based data structures, by increasing the cache line utilization. However, in production compilers WPSL transformations do not realize the entire performance potential possible due to a number of factors. Structure layout decisions made on the basis of whole program aggregated affinity/hotness of structure fields, can be sub optimal for local code regions. WPSL is also restricted in applicability in production compilers for type unsafe languages like C/C++ due to the extensive legality checks and field sensitive pointer analysis required over the entire application. In order to overcome the issues associated with WPSL, we propose Region Based Structure Layout (RBSL) optimization framework, using selective data copying. We describe our RBSL framework, implemented in the production compiler for C/C++ on HP-UX IA-64. We show that acting in complement to the existing and mature WPSL transformation framework in our compiler, RBSL improves application performance in pointer intensive SPEC benchmarks ranging from 3% to 28% over WPSL.

international conference on parallel processing | 2010

Handling Conflicts with Compiler's Help in Software Transactional Memory Systems

Sandya Srivilliputtur Mannarswamy; R. Govindarajan

Atomic sections are supported in software through the use of optimistic concurrency by using Software Transactional Memory (STM). However STM implementations incur high overheads which reduce the wide-spread use of this approach by programmers. Conflicts are a major source of overheads in STMs. The basic performance premise of a transactional memory system is the optimistic concurrency principle wherein data updates executed by the transactions are to disjoint objects/memory locations, referred to as Disjoint Access Parallel (DAP). Otherwise, the updates conflict, and all but one of the transactions are aborted. Such aborts result in wasted work and performance degradation. While contention management systems in STM implementations try to reduce conflicts by various runtime feedback control mechanisms, they are not aware of the application’s structure and data access patterns and hence typically act after the conflicts have occurred. In this paper we propose a scheme based on compiler analysis, which can identify static atomic sections whose instances, when executed concurrently by more than one thread always conflict. Such an atomic section is referred to as Always Conflicting Atomic Section (ACAS). We propose and evaluate two techniques Selective Pessimistic Concurrency Control (SPCC) and compiler inserted Early Conflict Checks (ECC) which can help reduce the STM overheads caused by ACAS. We show that these techniques help reduce the aborts in 4 of the STAMP benchmarks by up to 27.52% while improving performance by 1.24% to 19.31%.

international conference on parallel architectures and compilation techniques | 2010

Analyzing cache performance bottlenecks of STM applications and addressing them with compiler's help

Sandya Srivilliputtur Mannarswamy; R. Govindarajan

Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs as an alternative to traditional lock based synchronization. However adoption of STM in mainstream software has been quite low due to its considerable overheads and its poor cache/memory performance. In this paper, we perform a detailed study of the cache behavior of STM applications and quantify the impact of different STM factors on the cache misses experienced by the applications. Based on our analysis, we propose a compiler driven Lock-Data Colocation (LDC), targeted at reducing the cache overheads on STM. We show that LDC is effective in improving the cache behavior of STM applications by reducing the dcache miss latency and improving execution time performance.

Sigplan Notices | 2006

TRICK: tracking and reusing compiler's knowledge

Sandya Srivilliputtur Mannarswamy; Shruti Doval; Hariharan Sandanagobalane; Mahesha Nanjundaiah

Compilers, during compilation, analyze the application being compiled and build up extensive knowledge about the program. This knowledge is essential for the compiler to produce correct object code. Though some part of this knowledge is retained in the generated object files as symbol table information to be used by the linker and/or debugger, most of it is discarded after the compilation is done. In this paper, we introduce the TRICK framework, which is an attempt to retain and reuse this internal information generated by the compiler as part of its program analysis, in building new tools or enhancing existing tools as well for reuse by the compiler for continuous program optimization. We present examples of how development and maintenance of various program analysis tools can be simplified by using the TRICK framework describing tools developed by our group as well as how TRICK framework can be employed in continuous program optimization by the compiler. TRICK framework can be part of both static and dynamic compilation system, though our current usage model is in the context of a static compilation system,

Explore More