Sanjeev Banerjia | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sanjeev Banerjia is active.

Explore More

Publication

Featured researches published by Sanjeev Banerjia.

programming language design and implementation | 2000

Dynamo: a transparent dynamic optimization system

Vasanth Bala; Evelyn Duesterwald; Sanjeev Banerjia

We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT for example), or it can come from the execution of a statically compiled native binary. This paper evaluates the Dynamo system in the latter, more challenging situation, in order to emphasize the limits, rather than the potential, of the system. Our experiments demonstrate that even statically optimized native binaries can be accelerated Dynamo, and often by a significant degree. For example, the average performance of --O optimized SpecInt95 benchmark binaries created by the HP product C compiler is improved to a level comparable to their --O4 optimized version running without Dynamo. Dynamo achieves this by focusing its efforts on optimization opportunities that tend to manifest only at runtime, and hence opportunities that might be difficult for a static compiler to exploit. Dynamos operation is transparent in the sense that it does not depend on any user annotations or binary instrumentation, and does not require multiple runs, or any special compiler, operating system or hardware support. The Dynamo prototype presented here is a realistic implementation running on an HP PA-8000 workstation under the HPUX 10.20 operating system.

international symposium on microarchitecture | 1998

Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures

Emre Özer; Sanjeev Banerjia; Thomas M. Conte

Recently, there has been a trend towards clustered microarchitectures to reduce the cycle time for wide issue microprocessors. In such processors, the register file and functional units are partitioned and grouped into clusters. Instruction scheduling for a clustered machine requires assignment and scheduling of operations to the clusters. In this paper, a new scheduling algorithm named unified-assign-and-schedule (UAS) is proposed for clustered, statically-scheduled architectures. UAS merges the cluster assignment and instruction scheduling phases in a natural and straightforward fashion. We compared the performance of UAS with various heuristics to the well-known Bottom-up Greedy (BUG) algorithm and to an optimal cluster scheduling algorithm, measuring the schedule lengths produced by all of the schedulers. Our results show that UAS gives better performance than the BUG algorithm and is quite close to optimal.

international symposium on microarchitecture | 1996

Instruction fetch mechanisms for VLIW architectures with compressed encodings

Thomas M. Conte; Sanjeev Banerjia; Sergei Y. Larin; Kishore N. Menezes; Sumedh W. Sathaye

VLIW architectures use very wide instruction words in conjunction with high bandwidth to the instruction cache to achieve multiple instruction issue. This report uses the TINKER experimental testbed to examine instruction fetch and instruction cache mechanisms for VLIWs. A compressed instruction encoding for VLIWs is defined and a classification scheme for i-fetch hardware for such an encoding is introduced. Several interesting cache and i-fetch organizations are described and evaluated through trace-driven simulations. A new i-fetch mechanism using a silo cache is found to have the best performance.

high-performance computer architecture | 1998

Treegion scheduling for wide issue processors

William A. Havanki; Sanjeev Banerjia; Thomas M. Conte

Instruction scheduling is one of the most important phases of compilation for high-performance processors. A compiler typically divides a program into multiple regions of code and then schedules each region. Many past efforts have focused on linear regions such as traces and superblocks. The linearity of these regions can limit speculation, leading to under-utilization of processor resources, especially on wide-issue machines. A type of non-linear region called a treegion is presented in this paper. The formation and scheduling of treegions takes into account multiple execution paths, and the larger scope of treegions allows more speculation, leading to higher utilization and better performance. Multiple scheduling heuristics for treegions are compared against scheduling for several types of linear regions. Empirical results illustrate that instruction scheduling using treegions treegion scheduling-holds promise. Treegion scheduling using the global weight heuristic outperforms the next highest performing region-superblocks by up to 20%.

european conference on parallel processing | 1997

Treegion Scheduling for Highly Parallel Processors

Sanjeev Banerjia; William A. Havanki; Thomas M. Conte

Instruction scheduling is a compile-time technique for extracting parallelism from programs for statically scheduled instruction level parallel processors. Typically, an instruction scheduler partitions a program into regions and then schedules each region. One style of region represents a program as a set of decision trees or treegions. The non-linear nature of the treegion allows scheduling across multiple paths. This paper presents such a technique, termed treegion scheduling. The results of experiments comparing treegion scheduling to scheduling for basic blocks and across “simple linear regions” show that treegion scheduling outperforms the other techniques.

international conference on parallel architectures and compilation techniques | 1998

A fast interrupt handling scheme for VLIW processors

Emre Özer; Sumedh W. Sathaye; Kishore N. Menezes; Sanjeev Banerjia; Matthew D. Jennings; Thomas M. Conte

Interrupt handling in out-of-order execution processors requires complex hardware schemes to maintain the sequential state. The amount of hardware will be substantial in VLIW architectures due to the nature of issuing a very large number of instructions in each cycle. It is hard to implement precise interrupts in out-of-order execution machines, especially in VLIW processors. In this paper, we will apply the reorder buffer with future file and the history buffer methods to a VLIW platform, and present a novel scheme, called the current-state buffer, which employs modest hardware with compiler support. Unlike the other interrupt handling schemes, the current-state buffer does not keep history state, result buffering or bypass mechanisms. It is a fast interrupt handling scheme with a relatively small buffer that records the execution and exception status of operations. It is suitable for embedded processors that require a fast interrupt handling mechanism with modest hardware.

ieee multi chip module conference | 1996

Issues in partitioning integrated circuits for MCM-D/flip-chip technology

Sanjeev Banerjia; Alan Glaser; Christoforos Harvatis; Steve Lipa; Real Pomerleau; Toby Schaffer; Andrew Stanaski; Yusuf Tekmen; Grif Bilbro; Paul D. Franzon

In order to successfully partition a high performance large monolithic chip onto MCM-D/flip-chip-solder-bump technology, a number of key issues must be addressed. These include the following: (1) Partitioning a single clock-cycle path across the chip boundary within using; (2) Ability to use off-the-shelf memories; (3) Using the MCM for power, ground, and clock distribution; and (4) Managing test costs. This paper presents a discussion on these issues, using a CPU as an example, and speculates on some interesting possibilities arising from partitioning.

IEEE Transactions on Computers | 1998

MPS: miss-path scheduling for multiple-issue processors

Sanjeev Banerjia; Sumedh W. Sathaye; Kishore N. Menezes; Thomas M. Conte

Many contemporary multiple issue processors employ out-of-order scheduling hardware in the processor pipeline. Such scheduling hardware can yield good performance without relying on compile-time scheduling. The hardware can also schedule around unexpected run-time occurrences such as cache misses. As issue widths increase, however, the complexity of such scheduling hardware increases considerably and can have an impact on the cycle time of the processor. This paper presents the design of a multiple issue processor that uses an alternative approach called miss path scheduling or MPS. Scheduling hardware is removed from the processor pipeline altogether and placed on the path between the instruction cache and the next level of memory. Scheduling is performed at cache miss time as instructions are received from memory. Scheduled blocks of instructions are issued to an aggressively clocked in-order execution core. Details of a hardware scheduler that can perform speculation are outlined and shown to be feasible. Performance results from simulations are presented that highlight the effectiveness of an MPS design.

IEEE Transactions on Components, Packaging, and Manufacturing Technology: Part B | 1995

System design optimization for MCM-D/flip-chip

Paul D. Franzon; Andrew Stanaski; Yusuf Tekmen; Sanjeev Banerjia

Many performance/cost advantages can be gained if a chip-set is optimally redesigned to take advantage of the high wire density, fast interconnect delays, and high pin-counts available in MCM-D/flip-chip technology. Examples are given showing for what conditions the cost of the system can be reduced through chip partitioning and how the performance/cost of a computer core can be increased. >

international symposium on microarchitecture | 1996

A persistent rescheduled-page cache for low overhead object code compatibility in VLIW architectures

Thomas M. Conte; Sumedh W. Sathaye; Sanjeev Banerjia

Object-code compatibility between processor generations is an open issue for VLIW architectures. A potential solution is a technique termed dynamic rescheduling, which performs run-time software rescheduling at the first-time page faults. The time required for rescheduling the pages constitutes a large portion of the overhead of this method. A disk caching scheme that uses a persistent rescheduled-page cache (PRC) is presented. The scheme reduces the overhead associated with dynamic rescheduling by saving rescheduled pages on disk, across program executions. Operating system support is required for dynamic rescheduling and management of the PRC. The implementation details for the PRC are discussed. Results of simulations used to gauge the effectiveness of PRC indicate that: the PRC is effective in reducing the overhead of dynamic rescheduling; and due to different overhead requirements of programs, a split PRC organization performs better than a unified PRC. The unified PRC was studied for two different page replacement policies: LRU and overhead-based replacement. It was found that with LRU replacement, all the programs consistently perform better with increasing PRC sizes, but the high-overhead programs take a consistent performance hit compared to the low-overhead programs. With overhead-based replacement, the performance of high-overhead programs improves substantially, while the low-overhead programs perform only slightly worse than in the case of the LRU replacement.

Explore More