B. Ramakrishna Rau | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where B. Ramakrishna Rau is active.

Explore More

Publication

Featured researches published by B. Ramakrishna Rau.

international symposium on microarchitecture | 1994

Iterative module scheduling: an algorithm for software pipelining loops

B. Ramakrishna Rau

Modulo scheduling is a framework within which a wide variety of algorithms and heuristics may be defined for software pipelining innermost loops. This paper presents a practical algorithm, iterative modulo scheduling, that is capable of dealing with realistic machine models. This paper also characterizes the algorithm in terms of the quality of the generated schedules as well the computational expense incurred.

The Journal of Supercomputing | 1993

Instruction-level parallel processing: history, overview, and perspective

B. Ramakrishna Rau; Joseph A. Fisher

Instruction-level parallelism (ILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations to execute in parallel. Although ILP has appeared in the highest performance uniprocessors for the past 30 years, the 1980s saw it become a much more significant force in computer design. Several systems were built and sold commercially, which pushed ILP far beyond where it had been before, both in terms of the amount of ILP offered and in the central role ILP played in the design of the system. By the end of the decade, advanced microprocessor design at all major CPU manufacturers had incorporated ILP, and new techniques for ILP had become a popular topic at academic conferences. This article provides an overview and historical perspective of the field of ILP and its development over the past three decades.

International Journal of Parallel Programming | 1996

Iterative Modulo Scheduling

B. Ramakrishna Rau

Modulo scheduling is a framework within which algorithms for software pipelining innermost loops may be defined. The framework specifies a set of constraints that must be met in order to achieve a legal modulo schedule. A wide variety of algorithms and heuristics can be defined within this framework. Little work has been done to evaluate and compare alternative algorithms and heuristics for modulo scheduling from the viewpoints of schedule quality as well as computational complexity. This, along with a vague and unfounded perception that modulo scheduling is computationally expensive as well as difficult to implement, have inhibited its incorporation into product compilers. This paper presents iterative modulo scheduling, a practical algorithm that is capable of dealing with realistic machine models. The paper also characterizes the algorithm in terms of the quality of the generated schedules as well the computational expense incurred.

IEEE Computer | 2002

PICO: automatically designing custom computers

Vinod Kathail; Shail Aditya; Robert Schreiber; B. Ramakrishna Rau; Darren C. Cronquist; Mukund Sivaraman

The paper discusses the PICO (program in, chip out) project, a long-range HP Labs research effort that aims to automate the design of optimized, application-specific computing systems - thus enabling the rapid and cost-effective design of custom chips when no adequately specialized, off-the-shelf design is available. PICO research takes a systematic approach to the hierarchical design of complex systems and advances technologies for automatically designing custom nonprogrammable accelerators and VLIW processors. While skeptics often assume that automated design must emulate human designers who invent new solutions to problems, PICOs approach is to automatically pick the most suitable designs from a well-engineered space of designs. Such automation of embedded computer design promises an era of yet more growth in the number and variety of innovative smart products by lowering the barriers of design time, designer availability, and design cost.

architectural support for programming languages and operating systems | 1992

Sentinel scheduling for VLIW and superscalar processors

Scott A. Mahlke; William Y. Chen; Wen-mei W. Hwu; B. Ramakrishna Rau; Michael S. Schlansker

Speculative execution is an important source of parallelism for VLIW and superscalar processors. A serious challenge with compiler-controlled speculative execution is to accurately detect and report all program execution errors at the time of occurrence. In this paper, a set of architectural features and compile-time scheduling support referred to as sentinel scheduling is introduced. Sentinel scheduling provides an effective framework for compiler-controlled speculative execution that accurately detects and reports all exceptions. Sentinel scheduling also supports speculative execution of store instructions by providing a store buffer which allows probationary entries. Experimental results show that sentinel scheduling is highly effective for a wide range of VLIW and superscalar processors.

signal processing systems | 2002

PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators

Robert Schreiber; Shail Aditya; Scott A. Mahlke; Vinod Kathail; B. Ramakrishna Rau; Darren C. Cronquist; Mukund Sivaraman

The PICO-NPA system automatically synthesizes nonprogrammable accelerators (NPAs) to be used as co-processors for functions expressed as loop nests in C. The NPAs it generates consist of a synchronous array of one or more customized processor datapaths, their controller, local memory, and interfaces. The user, or a design space exploration tool that is a part of the full PICO system, identifies within the application a loop nest to be implemented as an NPA, and indicates the performance required of the NPA by specifying the number of processors and the number of machine cycles that each processor uses per iteration of the inner loop. PICO-NPA emits synthesizable HDL that defines the accelerator at the register transfer level (RTL). The system also modifies the users application software to make use of the generated accelerator.The main objective of PICO-NPA is to reduce design cost and time, without significantly reducing design quality. Design of an NPA and its support software typically requires one or two weeks using PICO-NPA, which is a many-fold improvement over the industry norm. In addition, PICO-NPA can readily generate a wide-range of implementations with scalable performance from a single specification. In experimental comparison of NPAs of equivalent throughput, PICO-NPA designs are slightly more costly than hand-designed accelerators.Logic synthesis and place-and-route have been performed successfully on PICO-NPA designs, which have achieved high clock rates.

application-specific systems, architectures, and processors | 2000

High-level synthesis of nonprogrammable hardware accelerators

Robert Schreiber; Shail Aditya; B. Ramakrishna Rau; Vinod Kathail; Scott A. Mahlke; Santosh G. Abraham; Greg Snider

The PICO-N system automatically synthesizes embedded nonprogrammable accelerators to be used as co-processors for functions expressed as loop nests in C. The output is synthesizable VHDL that defines the accelerator at the register transfer level (RTL). The system generates a synchronous array of customized VLIW (very-long instruction word) processors, their controller local memory, and interfaces. The system also modifies the users application software to make use of the generated accelerator. The user indicates the throughput to be achieved by specifying the number of processors and their initiation interval. In experimental comparisons, PICO-N designs are slightly more costly than hand-designed accelerators with the same performance.

ACM Transactions on Computer Systems | 1993

Sentinel scheduling: a model for compiler-controlled speculative execution

Scott A. Mahlke; William Y. Chen; Roger A. Bringmann; Richard E. Hank; Wen-mei W. Hwu; B. Ramakrishna Rau; Michael S. Schlansker

Speculative execution is an important source of parallelism for VLIW and superscalar processors. A serious challenge with compiler-controlled speculative execution is to efficiently handle exceptions for speculative instructions. In this article, a set of architectural features and compile-time scheduling support collectively referred to as sentinel scheduling is introduced. Sentinel scheduling provides an effective framework for both compiler-controlled speculative execution and exception handling. All program exceptions are accurately detected and reported in a timely manner with sentinel scheduling. Recovery from exceptions is also ensured with the model. Experimental results show the effectiveness of sentinel scheduling for exploiting instruction-level parallelism and overhead associated with exception handling.

programming language design and implementation | 1993

Reverse If-Conversion

Nancy J. Warter; Scott A. Mahlke; Wen-mei W. Hwu; B. Ramakrishna Rau

In this paper we present a set of isomorphic control transformations that allow the compiler to apply local scheduling techniques to acyclic subgraphs of the control flow graph. Thus, the code motion complexities of global scheduling are eliminated. This approach relies on a new technique, Reverse If-Conversion (RIC), that transforms scheduled If-Converted code back to the control flow graph representation. This paper presents the predicate internal representation, the algorithms for RIC, and the correctness of RIC. In addition, the scheduling issues are addressed and an application to software pipelining is presented.

Science | 1991

Instruction-Level Parallel Processing

Joseph A. Fisher; B. Ramakrishna Rau

The performance of microprocessors has increased steadily over thepast 20 years at a rate of about 50% per year. This is the cumulative result of architectural improvements as well as increases in circuit speed. Moreover, this improvement has been obtained in a transparent fashion, that is, without requiring programmers to rethink their algorithms and programs, thereby enabling the tremendous proliferation of computers that we see today. To continue this performance growth, microprocessor designers have incorporated instruction-level parallelism (ILP) into new designs. ILP utilizes the parallel execution ofthe lowest level computer operations—adds, multiplies, loads, and so on—to increase performance transparently. The use of ILP promises to make possible, within the next few years, microprocessors whose performance is many times that of a CRAY-IS. This article provides an overview of ILP, with an emphasis on ILP architectures—superscalar, VLIW, and dataflow processors—and the compiler techniques necessary to make ILP work well.

Explore More