Rajiv A. Ravindran | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rajiv A. Ravindran is active.

Explore More

Publication

Featured researches published by Rajiv A. Ravindran.

international symposium on microarchitecture | 2007

Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures

Michael L. Chu; Rajiv A. Ravindran; Scott A. Mahlke

The recent design shift towards multicore processors has spawned a significant amount of research in the area of program parallelization. The future abundance of cores on a single chip requires programmer and compiler intervention to increase the amount of parallel work possible. Much of the recent work has fallen into the areas of coarse-grain parallelization: new programming models and different ways to exploit threads and data-level parallelism. This work focuses on a complementary direction, improving performance through automated fine-grain parallelization. The main difficulty in achieving a performance benefit from fine-grain parallelism is the distribution of data memory accesses across the data caches of each core. Poor choices in the placement of data accesses can lead to increased memory stalls and low resource utilization. We propose a profile-guided method for partitioning memory accesses across distributed data caches. First, a profile determines affinity relationships between memory accesses and working set characteristics of individual memory operations in the program. Next, a program-level partitioning of the memory operations is performed to divide the memory accesses across the data caches. As a result, the data accesses are proactively dispersed to reduce memory stalls and improve computation parallelization. A final detailed partitioning of the computation instructions is performed with knowledge of the cache location of their associated data. Overall, our data partitioning reduces stall cycles by up to 51 % versus data-incognizant partitioning, and has an overall speedup average of 30% over a single core processor.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2001

Bitwidth cognizant architecture synthesis of custom hardware accelerators

Scott A. Mahlke; Rajiv A. Ravindran; Michael S. Schlansker; Robert Schreiber; Timothy Sherwood

Program-in chip-out (PICO) is a system for automatically synthesizing embedded hardware accelerators from loop nests specified in the C programming language. A key issue confronted when designing such accelerators is the optimization of hardware by exploiting information that is known about the varying number of bits required to represent and process operands. In this paper, we describe the handling and exploitation of integer bitwidth in PICO. A bitwidth analysis procedure is used to determine bitwidth requirements for all integer variables and operations in a C application. Given known bitwidths for all variables, complex problems arise when determining a program schedule that specifies on which function unit (FU) and at what time each operation executes. If operations are assigned to FUs with no knowledge of bitwidth, bitwidth-related cost benefit is lost when each unit is built to accommodate the widest operation assigned. By carefully placing operations of similar width on the same unit, hardware costs are decreased. This problem is addressed using a preliminary clustering of operations that is based jointly on width and implementation cost. These clusters are then honored during resource allocation and operation scheduling to create an efficient width-conscious design. Experimental results show that exploiting integer bitwidth substantially reduces the gate count of PICO-synthesized hardware accelerators across a range of applications.

symposium on code generation and optimization | 2005

Compiler Managed Dynamic Instruction Placement in a Low-Power Code Cache

Rajiv A. Ravindran; Pracheeti D. Nagarkar; Ganesh S. Dasika; Eric D. Marsman; Robert M. Senger; Scott A. Mahlke; Richard B. Brown

Modern embedded microprocessors use low power on-chip memories called scratch-pad memories to store frequently executed instructions and data. Unlike traditional caches, scratch-pad memories lack the complex tag checking and comparison logic, thereby proving to be efficient in area and power. In this work, we focus on exploiting scratch-pad memories for storing hot code segments within an application. Static placement techniques focus on placing the most frequently executed portions of programs into the scratch-pad. However, static schemes are inherently limited by not allowing the contents of the scratch-pad memory to change at run time. In a large fraction of applications, the instruction memory footprints exceed the scratch-pad memory size, thereby limiting the usefulness of the scratch-pad. We propose a compiler managed dynamic placement algorithm, wherein multiple hot code sequences, or traces, are overlapped with each other in the scratch-pad memory at different points in time during execution. Special copy instructions are provided to copy the traces into the scratch-pad memory at run-time. Using a power estimate, the compiler initially selects the most frequent traces in an application for relocation into the scratch-pad memory. Through iterative code motion and redundancy elimination, copy instructions are inserted in infrequently executed regions of the code. For a 64-byte code cache, the compiler managed dynamic placement achieves an average of 64% energy improvement over the static solution in a low-power embedded microcontroller.

application specific systems architectures and processors | 2003

Systematic register bypass customization for application-specific processors

Kevin Fan; Nathan Clark; Michael L. Chu; K. V. Manjunath; Rajiv A. Ravindran; Mikhail Smelyanskiy; Scott A. Mahlke

Register bypass provides additional datapaths to eliminate data hazards in processor pipelines. The difficulty with register bypass is that the cost of the bypass network is substantial and grows substantially as processor width or pipeline depth are increased. For a single application, many of the bypass paths have extremely low utilization. Thus, there is an important opportunity in the design of application-specific processors to remove a large fraction of the bypass cost while maintaining performance comparable to a processor with full bypass. We propose a systematic design customization process along with a bypass-cognizant compiler scheduler. For the former, we employ iterative design space exploration wherein successive processor designs are selected based on bypass utilization statistics combined with the availability of redundant bypass paths. Compiler scheduling for sparse bypass processors is accomplished by prioritizing function unit choices for each operation prior to scheduling using global information. Results show that for a 5-issue customized VLIW processor, 70% of the bypass cost is eliminated while sacrificing only 10% performance.

languages, compilers, and tools for embedded systems | 2007

Compiler-managed partitioned data caches for low power

Rajiv A. Ravindran; Michael L. Chu; Scott A. Mahlke

Set-associative caches are traditionally managed using hardware-based lookup and replacement schemes that have high energy overheads. Ideally, the caching strategy should be tailored to the applications memory needs, thus enabling optimal use of this on-chip storage to maximize performance while minimizing power consumption. However, doing this in hardware alone is difficult due to hardware complexity, high power dissipation, overheads of dynamic discovery of application characteristics, and increased likelihood of making locally optimal decisions. The compiler can instead determine the caching strategy by analyzing the application code and providing hints to the hardware. We propose a hardware/software co-managed partitioned cache architecture in which enhanced load/store instructions are used to control fine-grain data placement within a set of cache partitions. In comparison to traditional partitioning techniques, load and store instructions can individually specify the set of partitions for lookup and replacement. This fine grain control can avoid conflicts, thus providing the performance benefits of highly associative caches, while saving energy by eliminating redundant tag and data array accesses. Using four direct-mapped partitions, we eliminated 25% of the tag checks and recorded an average 15% reduction in the energy-delay product compared to a hardware-managed 4-way set-associative cache.

IEEE Transactions on Computers | 2005

Partitioning variables across register windows to reduce spill code in a low-power processor

Rajiv A. Ravindran; Robert M. Senger; Eric D. Marsman; Ganesh S. Dasika; Matthew R. Guthaus; Scott A. Mahlke; Richard B. Brown

Low-power embedded processors utilize compact instruction encodings to achieve small code size. Such encodings place tight restrictions on the number of bits available to encode operand specifiers and, thus, on the number of architected registers. As a result, performance and power are often sacrificed as the burden of operand supply is shifted from the register file to the memory due to the limited number of registers. In this paper, we investigate the use of a windowed register file to address this problem by providing more registers than allowed in the encoding. The registers are organized as a set of identical register windows where, at each point in the execution, there is a single active window. Special window management instructions are used to change the active window and to transfer values between windows. This design gives the appearance of a large register file without compromising the instruction encoding. To support the windowed register file, we designed and implemented a graph partitioning-based compiler algorithm that partitions program variables and temporaries referenced within a procedure across multiple windows. On a 16-bit embedded processor, an average of 11 percent improvement in application performance and 25 percent reduction in system power was achieved as an 8-register design was scaled from one to two windows.

international symposium on circuits and systems | 2005

A 16-bit low-power microcontroller with monolithic MEMS-LC clocking

Eric D. Marsman; Robert M. Senger; Michael S. McCorquodale; Matthew R. Guthaus; Rajiv A. Ravindran; Ganesh S. Dasika; Scott A. Mahlke; Richard B. Brown

Low-power, single-chip integrated systems are prevailing in remote applications due to the increasing power and delay cost of inter-chip communication compared to on-chip computation. The paper describes the design and measured performance of a fully-functional digital core with a low-jitter, monolithic, MEMS-LC clock reference. This chip has been fabricated in TSMCs 0.18 μm MM/RF bulk CMOS process. Maximum power consumption of the complete microsystem is 48.78 mW operating at 90 MHz with a 1.8 V power supply.

compilers, architecture, and synthesis for embedded systems | 2003

Increasing the number of effective registers in a low-power processor using a windowed register file

Rajiv A. Ravindran; Robert M. Senger; Eric D. Marsman; Ganesh S. Dasika; Matthew R. Guthaus; Scott A. Mahlke; Richard B. Brown

Low-power embedded processors utilize compact instruction encodings to achieve small code size. Instruction sizes of 8 to 16 bits are common. Such encodings place tight restrictions on the number of bits available to encode operand specifiers, and thus on the number of architected registers. The central problem with this approach is that performance and power are often sacrificed as the burden of operand supply is shifted from the register file to the memory due to the limited number of registers. In this paper, we investigate the use of a windowed register file to address this problem by providing more registers than allowed in the encoding. The registers are organized as a set of identical register windows where at each point in the execution there is a single active window. Special window management instructions are used to change the active window and to transfer values between windows. The goal of this design is to give the appearance of a large register file without compromising the instruction encoding. To support the windowed register file, we designed and implemented a novel graph partitioning based compiler algorithm that partitions virtual registers within a given procedure across multiple windows. On a 16-bit embedded processor with a parameterized register window, an average of 10% improvement in application performance and 7% reduction in system power was achieved as an eight-register design was scaled from one to four windows.

symposium on code generation and optimization | 2004

FLASH: foresighted latency-aware scheduling heuristic for processors with customized datapaths

Manjunath Kudlur; Kevin Fan; Michael L. Chu; Rajiv A. Ravindran; Nathan Clark; Scott A. Mahlke

Application-specific instruction set processors (ASIPs) have the potential to meet the challenging cost, performance, and power goals of future embedded processors by customizing the hardware to suit an application. A central problem is creating compilers that are capable of dealing with the heterogeneous and nonuniform hardware created by the customization process. The processor datapath provides an effective area to customize, but specialized datapaths often have nonuniform connectivity between the function units, making the effective latency of a function unit dependent on the consuming operation. Traditional instruction schedulers break down in this environment due to their locally greedy nature of binding the best choice for a single operation even though that choice may be poor due to a lack of communication paths. To effectively schedule with nonuniform connectivity, we propose a foresighted latency-aware scheduling heuristic (FLASH) that performs lookahead across future scheduling steps to estimate the effects of a potential binding. FLASH combines a set of lookahead heuristics to achieve effective foresight with low compile-time overhead.

international symposium on microarchitecture | 2004

Cost-sensitive partitioning in an architecture synthesis system for multicluster processors

Michael L. Chu; Kevin Fan; Rajiv A. Ravindran; Scott A. Mahlke

Application-specific instruction processors (ASIPs) have great potential to meet the challenging demands of pervasive systems. This hierarchical system automatically designs highly customized multicluster processors. In the first of two tightly coupled components, design space exploration heuristically searches the basic capabilities that define the processors overall parallelism. In the second, a hardware compiler determines the detailed architecture configuration that realizes the parallelism.

Explore More