Sanket Tavarageri | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sanket Tavarageri is active.

Explore More

Publication

Featured researches published by Sanket Tavarageri.

symposium on code generation and optimization | 2010

Parameterized tiling revisited

Muthu Manikandan Baskaran; Albert Hartono; Sanket Tavarageri; Thomas Henretty; J. Ramanujam; P. Sadayappan

Tiling, a key transformation for optimizing programs, has been widely studied in literature. Parameterized tiled code is important for auto-tuning systems since they often execute a large number of runs with dynamically varied tile sizes. Previous work on tiled code generation has addressed parameterized tiling for the sequential context, and the parallel case with fixed compile-time constants for tile sizes. In this paper, we revisit the problem of generating tiled code using parametric tile sizes. We develop a systematic approach to formulate tiling transformations through manipulation of linear inequalities and develop a novel approach to overcoming the fundamental obstacle faced by previous approaches regarding generation of parallel parameterized tiled code. To the best of our knowledge, the approach proposed in this paper is the first compile-time solution to the problem of parallel parameterized code generation for affine imperfectly nested loops. Experimental results demonstrate the effectiveness of the implemented system.

ieee international conference on high performance computing, data, and analytics | 2011

Dynamic selection of tile sizes

Sanket Tavarageri; Louis-Noël Pouchet; J. Ramanujam; Atanas Rountev; P. Sadayappan

Tiling is a key program transformation to achieve effective data reuse. But the performance of tiled programs can vary considerably with different tile sizes. Hence the selection of good tile sizes is crucial. Although there has been considerable research on analytical models for selecting tile sizes, they have not been shown to be effective in finding optimal tile sizes across a range of programs and target architectures. Auto-tuning is a viable alternative that is often used in practice, and involves the execution of different combinations of tile sizes in a systematic fashion to find the best ones. But this is sometimes infeasible — for instance when the program is to be run on unknown platforms (e.g., cloud environments). We propose a novel approach for generating code to enable dynamic tile size selection, based on monitoring the performance of a few loop iterations. The selection operates at run time on the “production” run, without any a priori knowledge of the execution environment. We discuss the theory and implementation of a parametric tiled code generator that enables run-time tile size tuning and describe a search strategy to determine effective tile sizes. Experimental results demonstrate the effectiveness of the approach.

programming language design and implementation | 2014

Compiler-assisted detection of transient memory errors

Sanket Tavarageri; Sriram Krishnamoorthy; P. Sadayappan

The probability of bit flips in hardware memory systems is projected to increase significantly as memory systems continue to scale in size and complexity. Effective hardware-based error detection and correction require that the complete data path, involving all parts of the memory system, be protected with sufficient redundancy. First, this may be costly to employ on commodity computing platforms, and second, even on high-end systems, protection against multi-bit errors may be lacking. Therefore, augmenting hardware error detection schemes with software techniques is of considerable interest. In this paper, we consider software-level mechanisms to comprehensively detect transient memory faults. We develop novel compile-time algorithms to instrument application programs with checksum computation codes to detect memory errors. Unlike prior approaches that employ checksums on computational and architectural states, our scheme verifies every data access and works by tracking variables as they are produced and consumed. Experimental evaluation demonstrates that the proposed comprehensive error detection solution is viable as a completely software-only scheme. We also demonstrate that with limited hardware support, overheads of error detection can be further reduced.

international conference on parallel processing | 2014

PWCET: Power-Aware Worst Case Execution Time Analysis

Wenlei Bao; Sanket Tavarageri; Füsun Özgüner; P. Sadayappan

Worst case execution time (WCET) analysis is used to verify that real-time tasks on systems can be executed without violating any timing constraints. Power consumption is not considered in most of the WCET research work. However, real-time embedded systems have limited power resources and the leakage power cost by the cache could take up to 40% of the total power consumption. Because of the nature of data reuse in programs, not all of the cache may be needed during the execution. Traditional WCET analysis, which derives the WCET bound by assuming all cache is utilized, results in higher power dissipation. In this paper, we propose a Power-aware WCET analysis framework (PWCET) to improve energy efficiency by reducing cache size with no penalty on the WCET result. A compiler analysis based useful cache calculation algorithm is used to determine the non-useful cache size to be turned off to save power. Experiments on benchmarks demonstrate the effectiveness of the PWCET framework in obtaining considerable power savings without any effect on WCET bounds.

ieee international conference on high performance computing data and analytics | 2013

Adaptive parallel tiled code generation and accelerated auto-tuning

Sanket Tavarageri; J. Ramanujam; P. Sadayappan

Tiling is an important program transformation that is often used to enhance cache locality and to obtain coarse-grained parallelism. In this paper, we address the problem of generating adaptive parametric tiled code for parallel execution contexts; in other words, generating parallel tiled code in which tile sizes can be changed on the fly during execution. Changing of tile sizes during pipelined parallel execution of tiles presents the following fundamental code-generation challenge: the unscanned iteration space may become non-convex. We develop novel solutions for the adaptive parallel tiled code generation problem. Using adaptive tiling, auto-tuning for tile size selection can be accelerated: in a single run of the tiled code, several tile sizes may be tested for their performance and thus expedite auto-tuning. Adaptive tiling is also useful in scenarios where tile sizes need to be dynamically altered to tailor to the changing execution environments, such as dynamically resized caches for power savings. Experimental evaluation on a number of benchmarks demonstrates the effectiveness of the developed approach.

international parallel and distributed processing symposium | 2016

Architecting and Programming a Hardware-Incoherent Multiprocessor Cache Hierarchy

Wooil Kim; Sanket Tavarageri; P. Sadayappan; Josep Torrellas

New architectures for extreme-scale computing need to be designed for higher energy efficiency than current systems. One recently-proposed extreme-scale many core radically simplifies the architecture, and proposes a cluster-based on-chip memory hierarchy without hardware cache coherence. To program for such an environment, this paper proposes two approaches. They use shared-memory programming either inside clusters only, or both inside and across clusters. Both approaches rely on ISA support for writeback and self-invalidation operations. Our simulation results show that hardware-incoherent cache hierarchies with our support deliver reasonable performance for applications that were not written for such hierarchies. Specifically, for execution within a cluster, the average execution time of the applications is 2% higher than with hardware cache coherence, for execution across multiple clusters, it is 5% higher than with hardware cache coherence. This is accomplished with minimal hardware support.

ieee high performance extreme computing conference | 2015

Automatic cluster parallelization and minimizing communication via selective data replication

Sanket Tavarageri; Benoît Meister; Muthu Manikandan Baskaran; Benoît Pradelle; Thomas Henretty; Athanasios Konstantinidis; Ann Johnson; Richard Lethin

The technology scaling has initiated two distinct trends that are likely to continue into future: first, the increased parallelism in hardware and second, the increasing performance and energy cost of communication relative to computation. Both of the above trends call for development of compiler and runtime systems to automatically parallelize programs and reduce communication in parallel computations to achieve the desired high performance in an energy-efficient fashion. In this paper, we propose the design of an integrated compiler and runtime system that auto-parallelizes loop-nests to clusters and, a novel communication avoidance method that reduces data movement between processors. Communication minimization is achieved via data replication: data is replicated so that a larger share of the whole data set may be mapped to a processor and hence, non-local memory accesses reduced. Experiments on a number of benchmarks show the effectiveness of the approach.

ieee international conference on high performance computing data and analytics | 2016

Compiler Support for Software Cache Coherence

Sanket Tavarageri; Wooil Kim; Josep Torrellas; P. Sadayappan

The advent of multi-core processors with a largenumber of cores and heterogeneous architecture poses challengesfor achieving scalable cache coherence. Several recent researchefforts have focused on simplifying or abandoning hardwarecache coherence protocols. However, this adds a significantburden on the programmer, unless automated compiler supportis developed. In this paper, we develop compiler support for parallel systemsthat delegate the task of maintaining cache coherence to software. Algorithms to automatically insert software cache coherenceinstructions into parallel applications are presented. This freesthe programmer from having to manually insert coherenceannotations, which can be tedious and error-prone. Experimentalevaluation over a number of benchmarks demonstrates thateffective compiler techniques can make software cache coherencecompetitive with hardware coherence schemes both in terms ofenergy and performance.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013