Ramadass Nagarajan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ramadass Nagarajan is active.

Explore More

Publication

Featured researches published by Ramadass Nagarajan.

international symposium on computer architecture | 2003

Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Karthikeyan Sankaralingam; Ramadass Nagarajan; Haiming Liu; Changkyu Kim; Jaehyuk Huh; Doug Burger; Stephen W. Keckler; Charles R. Moore

This paper describes the polymorphous TRIPS architecture which can be configured for different granularities and types of parallelism. TRIPS contains mechanisms that enable the processing cores and the on-chip memory system to be configured and combined in different modes for instruction, data, or thread-level parallelism. To adapt to small and large-grain concurrency, the TRIPS architecture contains four out-of-order, 16-wide-issue Grid Processor cores, which can be partitioned when easily extractable fine-grained parallelism exists. This approach to polymorphism provides better performance across a wide range of application types than an approach in which many small processors are aggregated to run workloads with irregular parallelism. Our results show that high performance can be obtained in each of the three modes--ILP, TLP, and DLP-demonstrating the viability of the polymorphous coarse-grained approach for future microprocessors.

international symposium on microarchitecture | 2001

A design space evaluation of grid processor architectures

Ramadass Nagarajan; Karthikeyan Sankaralingam; Doug Burger; Stephen W. Keckler

In this paper we survey the design space of a new class of architectures called Grid Processor Architectures (GPAs). These architectures are designed to scale with technology, allowing faster clock rates than conventional architectures while providing superior instruction-level parallelism on traditional workloads and high performance across a range of application classes. A GPA consists of an array of ALUs, each with limited control, connected by a thin operand network. Programs are executed by mapping blocks of statically scheduled instructions to the ALU array and executing them dynamically in dataflow order. This organization enables the critical paths of instruction blocks to be executed on chains of ALUs without transmitting temporary values back to the register file, avoiding most of the large, unscalable structures that limit the scability of conventional architectures. Finally, we present simulation results of a preliminary design, the GPA-1. With a half-cycle routing delay, we obtain performance roughly equal to an ideal 8-way, 512-entry window superscalar core. With no inter-ALU delay, perfect memory, and perfect branch prediction, the IPC of the GPA-1 is more than twice that of the ideal superscalar core, achieving an average of 111PC across nine SPEC CPU2000 and Mediabench benchmarks.

international symposium on microarchitecture | 2006

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

Karthikeyan Sankaralingam; Ramadass Nagarajan; Robert McDonald; Rajagopalan Desikan; S. Drolia; Madhu Saravana Sibi Govindan; P. Gratzf; Divya P. Gulati; Heather Hanson; Changkyu Kim; Haiming Liu; Nitya Ranganathan; Simha Sethumadhavan; S. Shariff; Premkishore Shivakumar; Stephen W. Keckler; Doug Burger

Growing on-chip wire delays will cause many future microarchitectures to be distributed, in which hardware resources within a single processor become nodes on one or more switched micronetworks. Since large processor cores will require multiple clock cycles to traverse, control must be distributed, not centralized. This paper describes the control protocols in the TRIPS processor, a distributed, tiled microarchitecture that supports dynamic execution. It details each of the five types of reused tiles that compose the processor, the control and data networks that connect them, and the distributed microarchitectural protocols that implement instruction fetch, execution, flush, and commit. We also describe the physical design issues that arose when implementing the microarchitecture in a 170M transistor, 130nm ASIC prototype chip composed of two 16-wide issue distributed processor cores and a distributed 1MB non-uniform (NUCA) on-chip memory system

ACM Transactions on Architecture and Code Optimization | 2004

TRIPS: A polymorphous architecture for exploiting ILP, TLP, and DLP

Karthikeyan Sankaralingam; Ramadass Nagarajan; Haiming Liu; Changkyu Kim; Jaehyuk Huh; Nitya Ranganathan; Doug Burger; Stephen W. Keckler; Robert McDonald; Charles R. Moore

This paper describes the polymorphous TRIPS architecture that can be configured for different granularities and types of parallelism. The TRIPS architecture is the first in a class of post-RISC, dataflow-like instruction sets called explicit data-graph execution (EDGE). This EDGE ISA is coupled with hardware mechanisms that enable the processing cores and the on-chip memory system to be configured and combined in different modes for instruction, data, or thread-level parallelism. To adapt to small and large-grain concurrency, the TRIPS architecture prototype contains two out-of-order, 16-wide-issue grid processor cores, which can be partitioned when easily extractable fine-grained parallelism exists. This approach to polymorphism provides better performance across a wide range of application types than an approach in which many small processors are aggregated to run workloads with irregular parallelism. Our results show that high performance can be obtained in each of the three modes---ILP, TLP, and DLP---demonstrating the viability of the polymorphous coarse-grained approach for future microprocessors.

international conference on parallel architectures and compilation techniques | 2004

Static Placement, Dynamic Issue (SPDI) Scheduling for EDGE Architectures

Ramadass Nagarajan; Sundeep K. Kushwaha; Doug Burger; Kathryn S. McKinley; Calvin Lin; Stephen W. Keckler

Technology trends present new challenges for processor architectures and their instruction schedulers. Growing transistor density increases the number of execution units on a single chip, and decreasing wire transmission speeds causes long and variable on-chip latencies. These trends severely limit the two dominant conventional architectures: dynamic issue superscalars, and static placement and issue VLIWs. We present a new execution model in which the hardware and static scheduler instead work cooperatively, called static placement dynamic issue (SPDI). This paper focuses on the static instruction scheduler for SPDI. We identify and explore three issues SPDI schedulers must consider - locality, contention, and depth of speculation. We evaluate a range of SPDI scheduling algorithms executing on an explicit data graph execution (EDGE) architecture. We find that a surprisingly simple one achieves an average of 5.6 instructions-per-cycle (IPC) for SPEC2000 64-wide issue machine, and is within 80% of the performance without on-chip latencies. These results suggest that the compiler is effective at balancing on-chip latency and parallelism, and that the division of responsibilities between the compiler and the architecture is well suited to future systems.

international symposium on microarchitecture | 2003

Exploiting ILP, TLP, and DLP with the polymorphous trips architecture

Karthikeyan Sankaralingam; Ramadass Nagarajan; Haiming Liu; Changkyu Kim; Jaehyuk Huh; Doug Burger; Stephen W. Keckler; Charles R. Moore

The Tera-op reliable intelligently adaptive processing system (TRIPS) architecture seeks to deliver system-level configurability to applications and runtime systems. It does so by employing the concept of polymorphism, which permits the runtime system to configure the hardware execution resources to match the mode of execution and demands of the compiler and application.

international solid-state circuits conference | 2003

A wire-delay scalable microprocessor architecture for high performance systems

Stephen W. Keckler; Doug Burger; Charles R. Moore; Ramadass Nagarajan; Karthikeyan Sankaralingam; Vikas Agarwal; M. S. Hrishikesh; Nitya Ranganathan; Premkishore Shivakumar

This scalable processor architecture consists of chained ALUs to minimize the physical distance between dependent instructions, thus mitigating the effect of long on-chip wire delays. Simulation studies demonstrate 1.3-15/spl times/ more instructions per clock than conventional superscalar architectures.

international symposium on microarchitecture | 2006

Dataflow Predication

Aaron Smith; Ramadass Nagarajan; Karthikeyan Sankaralingam; Robert McDonald; Doug Burger; Stephen W. Keckler; Kathryn S. McKinley

Predication facilitates high-bandwidth fetch and large static scheduling regions, but has typically been too complex to implement comprehensively in out-of-order micro architectures. This paper describes dataflow predication, which provides per-instruction predication in a dataflow ISA, low predication computation overheads similar to VLIW ISAs, and low complexity out-of-order issue. A two-bitfield in each instruction specifies whether an instruction is predicated, in which case, an arriving predicate token determines whether an instruction should execute. Dataflow predication incorporates three features that reduce predication overheads. First, dataflow predicate computation permits computation of compound predicates with virtually no overhead instructions. Second, early mispredication termination squashes in-flight instructions with false predicates at any time, eliminating the overhead of falsely predicated paths. Finally, implicit predication mitigates the fanout overhead of dataflow predicates by reducing the number of explicitly predicated instructions, by predicating only the heads of dependence chains. Dataflow predication also exposes new compiler optimizations - such as disjoint instruction merging and path-sensitive predicate removal - for increased performance of predicated code in an out-of-order design

international symposium on performance analysis of systems and software | 2006

Critical path analysis of the TRIPS architecture

Ramadass Nagarajan; Xia Chen; Robert McDonald; Doug Burger; Stephen W. Keckler

Fast, accurate, and effective performance analysis is essential for the design of modern processor architectures and improving application performance. Recent trends toward highly concurrent processors make this goal increasingly difficult. Conventional techniques, based on simulators and performance monitors, are ill-equipped to analyze how a plethora of concurrent events interact and how they affect performance. Prior research has shown the utility of critical path analysis in solving this problem. This analysis abstracts the execution of a program with a dependence graph. With simple manipulations on the graph, designers can gain insights into the bottlenecks of a design. This paper extends critical path analysis to understand the performance of a next-generation, high-ILP architecture. The TRIPS architecture introduces new features not present in conventional superscalar architectures. We show how dependence constraints introduced by these features, specifically the execution model and operand communication links, can be modeled with a dependence graph. We describe a new algorithm that tracks critical path information at a fine-grained level and yet can deliver an order of magnitude (30x) improvement in performance over previously proposed techniques. Finally, we provide a breakdown of the critical path for a select set of benchmarks and show an example where we use this information to improve the performance of a heavily-hand-optimized program by as much as 11%.

international conference on human-computer interaction | 2001

A Technology-Scalable Architecture for Fast Clocks and High ILP

Karthikeyan Sankaralingam; Ramadass Nagarajan; Steven W. Keckler; Doug Burger

CMOS technology scaling poses challenges in designing dynamically scheduled cores that can sustain both high instruction-level parallelism and aggressive clock frequencies. In this paper, we present a new architecture that maps compiler-scheduled blocks onto a two-dimensional grid of ALUs. For the mapped window of execution, instructions execute in a dataflow-like manner, with each ALU forwarding its result along short wires to the consumers of the result. We describe our studies of program behavior and a preliminary evaluation that show that this architecture has the potential for both high clock speeds and high ILP, and may offer the best of both the VLIW and dynamic superscalar architectures.

Explore More