Utpal Banerjee
University of California, Irvine
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Utpal Banerjee.
international conference on parallel processing | 2009
Arun Kejariwal; Alexandru Nicolau; Alexander V. Veidenbaum; Utpal Banerjee; Constantine D. Polychronopoulos
Parallel loops, such as a parallel DO loop, in Fortran, account for large percentage of the total execution time. Given this, we focus on the problem of how to efficiently schedule nested perfect/non-perfect parallel loops on the emerging multi-core systems. In this regard, one of the key aspects is how to determine the profitability of parallel execution and how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cache-aware scheduling of iteration spaces of such loops. Specifically, we propose a technique for iteration space scheduling which captures the effect of variation in the number of cache misses across the iteration space. Subsequently, we propose a general approach to capture the variation of both the number of cache misses and computation across the iteration space. We demonstrate the efficacy of our approach on a dedicated 4-way Intel Xeon based multiprocessor using several kernels from the industry-standard benchmarks.
computing frontiers | 2010
Arun Kejariwal; Milind Girkar; Xinmin Tian; Hideki Saito; Alexandru Nicolau; Alexander V. Veidenbaum; Utpal Banerjee; Constantine D. Polychronopoulos
Multi-cores such as the Intel Core 2 Duo, AMD Barcelona and IBM POWER6 are becoming ubiquitous. The number of cores and the resulting hardware parallelism is poised to increase rapidly in the foreseeable future. Nested thread-level speculative parallelization has been proposed as a means to exploit the hardware parallelism of such systems. In this paper, we present a methodology to gauge the efficacy of nested thread-level speculation with increasing level of nesting.
Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference on | 2009
Arun Kejariwal; Alexandru Nicolau; Utpal Banerjee; Alexander V. Veidenbaum; Constantine D. Polychronopoulos
The need for high performance per watt has led to development of multi-core systems such as the Intel Core 2 Duo processor and the Intel quad-core Kentsfield processor. Maximal exploitation of the hardware parallelism supported by such systems necessitates the development of concurrent software. This, in part, entails automatic parallelization of programs and efficient mapping of the parallelized program onto the different cores. The latter affects the load balance between the different cores which in turn has a direct impact on performance. In light of the fact that, parallel loops, such as a parallel DO loop in Fortran, account for a large percentage of the total execution time, we focus on the problem of how to efficiently partition the iteration space of (possibly) nested perfect/non-perfect parallel loops. In this regard, one of the key aspects is how to efficiently capture the cache behavior as the cache subsystem is often the main performance bottleneck in multi-core systems. In this paper, we present a novel profile-guided compiler technique for cache-aware scheduling of iteration spaces of such loops. Specifically, we propose a technique for iteration space scheduling which captures the effect of variation in the number of cache misses across the iteration space. Subsequently, we propose a general approach to capture the variation of both the number of cache misses and computation across the iteration space. We demonstrate the efficacy of our approach on a dedicated 4-way Intel® Xeon® based multiprocessor using several kernels from the industry-standard SPEC CPU2000 and CPU2006 benchmarks achieving speedups upto 62.5%.
Archive | 2016
Alex Aiken; Utpal Banerjee; Arun Kejariwal; Alexandru Nicolau
Kernel recognition techniques avoid the search for an appropriate initiation interval by dealing directly with a representation of the unrolled loop and its compaction. Intuitively, kernel recognition tries to achieve the effect of fully unrolling and then compacting the loop. In this chapter we present a number of kernel recognition methods that overcome the limitations of modulo scheduling in different ways, including methods that generate provably optimal schedules (in the absence of resource constraints) and handle the scheduling of tests directly, without the need for an intermediate abstraction such as hierarchical reduction.
Archive | 2016
Alex Aiken; Utpal Banerjee; Arun Kejariwal; Alexandru Nicolau
In this chapter we trace the history of computer architecture, focusing on the evolution of techniques for instruction-level parallelism. After briefly summarizing the early years of machine design, we focus on the development of out-of-order, pipelined, and multiple-issue processors. These are further divided into processors that do instruction scheduling entirely in hardware (e.g., superscalar machines) and those that expose the instruction scheduling to the compiler, particularly VLIW machines such as the Multiflow Trace, Cydra 5, and Itanium.
Archive | 2016
Alex Aiken; Utpal Banerjee; Arun Kejariwal; Alexandru Nicolau
A basic block in a program is a sequence of consecutive operations, such that control flow enters at the beginning and leaves at the end without internal branches. While basic block scheduling is the simplest non-trivial instruction scheduling problem, it is also the most fundamental and widely used in both software and hardware implementations of instruction scheduling. This chapter introduces basic terminology used in all subsequent chapters and covers a number of different approaches to basic block scheduling.
ACM Transactions on Programming Languages and Systems | 2011
Utpal Banerjee
Since its introduction by Joseph A. Fisher in 1979, trace scheduling has influenced much of the work on compile-time ILP (Instruction Level Parallelism) transformations. Initially developed for use in microcode compaction, it quickly became the main technique for machine-level compile-time parallelism exploitation. Although it has been used since the 1980s in many state-of-the-art compilers (e.g., Intel, Fujitsu, HP), a rigorous theory of trace scheduling is still lacking in the existing literature. This is reflected in the ad hoc way compensation code is inserted after a trace compaction, in the total absence of any attempts to measure the size of that compensation code, and so on.n The aim of this article is to create a mathematical theory of the foundation of trace scheduling. We give a clear algorithm showing how to insert compensation code after a trace is replaced with its schedule, and then prove that the resulting program is indeed equivalent to the original program. We derive an upper bound on the size of that compensation code, and show that this bound can be actually attained. We also give a very simple proof that the trace scheduling algorithm always terminates.
international conference on supercomputing | 2012
Utpal Banerjee; Kyle A. Gallivan; Gianfranco Bilardi; Manolis Katevenis
Archive | 1996
David C. Sehr; Utpal Banerjee; David Gelernter; Alexandru Nicolau; David A. Padua
Archive | 1995
Chua-Huang Huang; P. Sadayappan; Utpal Banerjee; David Gelernter; Alexandru Nicolau; David A. Padua