Bantwal R. Rau | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bantwal R. Rau is active.

Explore More

Publication

Featured researches published by Bantwal R. Rau.

international symposium on microarchitecture | 1993

Predictability of load/store instruction latencies

Santosh G. Abraham; Rabin A. Sugumar; Daniel Windheiser; Bantwal R. Rau; Rajiv Gupta

Due to increasing cache-miss latencies, cache control instructions are being implemented for future systems. The authors study the memory referencing behavior of individual machine-level instructions using simulations of fully-associative caches under MIN replacement. Their objective is to obtain a deeper understanding of useful program behavior that can be eventually employed at optimizing programs and to motivate architectural features aimed at improving the efficacy of memory hierarchies. The simulation results show that a very small number of load/store instructions account for a majority of data cache misses. Specifically, fewer than 10 instructions account for half the misses for six out of nine SPEC89 benchmarks. Selectively prefetching data referenced by a small number of instructions identified through profiling can reduce overall miss ratio significantly while only incurring a small number of unnecessary prefetches. >

international symposium on microarchitecture | 1993

Dynamically scheduled VLIW processors

Bantwal R. Rau

VLIW processors are viewed as an attractive way of achieving instruction-level parallelism because of their ability to issue multiple operations per cycle with relatively simple control logic. They are also perceived as being of limited interest as products because of the problem of object code compatibility across processors having different hardware latencies and varying levels of parallelism. The author introduces the concept of delayed split-issue and the dynamic scheduling hardware which, together, solve the compatibility problem for VLIW processors and, in fact, make it possible for such processors to use all of the interlocking and scoreboarding techniques that are known for superscalar processors. >

international symposium on microarchitecture | 1995

Region-based compilation: an introduction and motivation

Richard E. Hank; W.W. Hwu; Bantwal R. Rau

As the amount of instruction-level parallelism required to fully utilize VLIW and superscalar processors increases, compilers must perform increasingly more aggressive analysis, optimization, parallelization and scheduling on the input programs. Traditionally, compilers have been built assuming functions as the unit of compilation. In this framework, function boundaries tend to hide valuable optimization opportunities from the compiler. Function inlining may be applied to assemble strongly coupled functions into the same compilation unit at the cost of very large function bodies. This paper introduces a new technique, called region-based compilation, where the compiler is allowed to repartition the program into more desirable compilation units. Region-based compilation allows the compiler to control problem size while exposing inter-procedural optimization and code motion opportunities.

compilers, architecture, and synthesis for embedded systems | 2000

Efficient design space exploration in PICO

Santosh G. Abraham; Bantwal R. Rau

Automated design tools must understand and exploit the hierarchical structure of large design spaces. We have developed a general methodology for decomposing system design spaces into smaller component design spaces, followed by component-level evaluation, filtering, recomposition and system-level evaluation. This methodology greatly reduces the time and cost of design space exploration, since the typical number of system-level evaluations is greatly reduced. This paper describes the application of our decomposition methodology in the context of PICO. PICO is a design space exploration system that automatically generates embedded designs consisting of a stylized processor, hardware accelerator and a cache hierarchy, each customized to a benchmark. First, PICO splits the specified system design space into smaller design spaces, one for each of the components, viz. processor, accelerator and data/instruction/unified caches. PICO further partitions each component design space into predicated design spaces, so that all designs in a predicated design space satisfy a specified predicate. PICO uses component-level evaluations to identify the performance-cost optimal component-level Pareto designs in each predicated design space. PICO generates all compositions of Pareto designs from compatible predicated design spaces and uses a system-level evaluation to identify the Pareto designs at the system level. For reasonable design spaces, PICO reduces the design exploration time by over four orders of magnitude compared to an exhaustive approach.

Archive | 1999