Is this you? Create Your Porfile

Saman P. Amarasinghe

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Saman P. Amarasinghe is active.

Explore More

Publication

Featured researches published by Saman P. Amarasinghe.

compiler construction | 2002

StreamIt: A Language for Streaming Applications

William Thies; Michal Karczmarek; Saman P. Amarasinghe

We characterize high-performance streaming applications as a new and distinct domain of programs that is becoming increasingly important. The StreamIt language provides novel high-level representations to improve programmer productivity and program robustness within the streaming domain. At the same time, the StreamIt compiler aims to improve the performance of streaming applications via stream-specific analyses and optimizations. In this paper, we motivate, describe and justify the language features of StreamIt, which include: a structured model of streams, a messaging system for control, a re-initialization mechanism, and a natural textual syntax.

international symposium on microarchitecture | 2002

The Raw microprocessor: a computational fabric for software circuits and general-purpose programs

Michael Bedford Taylor; Jason Kim; Jason Miller; David Wentzlaff; Fae Ghodrat; Ben Greenwald; Henry Hoffman; Paul Johnson; Jaewook Lee; Walter Lee; Albert Ma; Arvind Saraf; Mark Seneski; Nathan Shnidman; Volker Strumpen; Matthew I. Frank; Saman P. Amarasinghe; Anant Agarwal

Wire delay is emerging as the natural limiter to microprocessor scalability. A new architectural approach could solve this problem, as well as deliver unprecedented performance, energy efficiency and cost effectiveness. The Raw microprocessor research prototype uses a scalable instruction set architecture to attack the emerging wire-delay problem by providing a parallel, software interface to the gate, wire and pin resources of the chip. An architecture that has direct, first-class analogs to all of these physical resources will ultimately let programmers achieve the maximum amount of performance and energy efficiency in the face of wire delay.

IEEE Computer | 1997

Baring it all to software: Raw machines

Elliot Waingold; Michael Bedford Taylor; Devabhaktuni Srikrishna; Vivek Sarkar; Walter Lee; Victor Lee; Jang Kim; Matthew I. Frank; Peter Finch; Rajeev Barua; Jonathan Babb; Saman P. Amarasinghe; Anant Agarwal

The most radical of the architectures that appear in this issue are Raw processors-highly parallel architectures with hundreds of very simple processors coupled to a small portion of the on-chip memory. Each processor, or tile, also contains a small bank of configurable logic, allowing synthesis of complex operations directly in configurable hardware. Unlike the others, this architecture does not use a traditional instruction set architecture. Instead, programs are compiled directly onto the Raw hardware, with all units told explicitly what to do by the compiler. The compiler even schedules most of the intertile communication. The real limitation to this architecture is the efficacy of the compiler. The authors demonstrate impressive speedups for simple algorithms that lend themselves well to this architectural model, but whether this architecture will be effective for future workloads is an open question.

IEEE Computer | 1996

Maximizing multiprocessor performance with the SUIF compiler

Mary W. Hall; Jennifer-Ann M. Anderson; Saman P. Amarasinghe; Brian R. Murphy; Shih-Wei Liao; Edouard Bugnion; Monica S. Lam

This article describes automatic parallelization techniques in the SUIF (Stanford University Intermediate Format) compiler that result in good multiprocessor performance for array-based numerical programs. Parallelizing compilers for multiprocessors face many hurdles. However, SUIFs robust analysis and memory optimization techniques enabled speedups on three fourths of the NAS and SPECfp95 benchmark programs.

Sigplan Notices | 1994

SUIF: an infrastructure for research on parallelizing and optimizing compilers

Robert P. Wilson; Robert S. French; Christopher S. Wilson; Saman P. Amarasinghe; Jennifer-Ann M. Anderson; Steven W. K. Tjiang; Shih-Wei Liao; Chau-Wen Tseng; Mary W. Hall; Monica S. Lam; John L. Hennessy

Compiler infrastructures that support experimental research are crucial to the advancement of high-performance computing. New compiler technology must be implemented and evaluated in the context of a complete compiler, but developing such an infrastructure requires a huge investment in time and resources. We have spent a number of years building the SUIF compiler into a powerful, flexible system, and we would now like to share the results of our efforts.SUIF consists of a small, clearly documented kernel and a toolkit of compiler passes built on top of the kernel. The kernel defines the intermediate representation, provides functions to access and manipulate the intermediate representation, and structures the interface between compiler passes. The toolkit currently includes C and Fortran front ends, a loop-level parallelism and locality optimizer, an optimizing MIPS back end, a set of compiler development tools, and support for instructional use.Although we do not expect SUIF to be suitable for everyone, we think it may be useful for many other researchers. We thus invite you to use SUIF and welcome your contributions to this infrastructure. Directions for obtaining the SUIF software are included at the end of this paper.

architectural support for programming languages and operating systems | 2006

Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Michael I. Gordon; William Thies; Saman P. Amarasinghe

As multicore architectures enter the mainstream, there is a pressing demand for high-level programming models that can effectively map to them. Stream programming offers an attractive way to expose coarse-grained parallelism, as streaming applications (image, video, DSP, etc.) are naturally represented by independent filters that communicate over explicit data channels.In this paper, we demonstrate an end-to-end stream compiler that attains robust multicore performance in the face of varying application characteristics. As benchmarks exhibit different amounts of task, data, and pipeline parallelism, we exploit all types of parallelism in a unified manner in order to achieve this generality. Our compiler, which maps from the StreamIt language to the 16-core Raw architecture, attains a 11.2x mean speedup over a single-core baseline, and a 1.84x speedup over our previous work.

international symposium on computer architecture | 2004

Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Michael Bedford Taylor; James Psota; Arvind Saraf; Nathan Shnidman; Volker Strumpen; Matthew I. Frank; Saman P. Amarasinghe; Anant Agarwal; Walter Lee; Jason E. Miller; David Wentzlaff; Ian Rudolf Bratt; Ben Greenwald; Henry Hoffmann; Paul Johnson; Jason Kim

This paper evaluates the Raw microprocessor. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance in the face of increasing wire delays. Raw approaches this challenge by implementing plenty of on-chip resources - including logic, wires, and pins - in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Raw supports both ILP and streams by routing operands between architecturally-exposed functional units over a point-to-point scalar operand network. This network offers low latency for scalar data transport. Raw manages the effect of wire delays by exposing the interconnect and using software to orchestrate both scalar and stream data transport. We have implemented a prototype Raw microprocessor in IBMs 180 nm, 6-layer copper, CMOS 7SF standard-cell ASIC process. We have also implemented ILP and stream compilers. Our evaluation attempts to determine the extent to which Raw succeeds in meeting its goal of serving as a more versatile, general-purpose processor. Central to achieving this goal is Raws ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Specifically, we evaluate the performance of Raw on a diverse set of codes including traditional sequential programs, streaming applications, server workloads and bit-level embedded computation. Our experimental methodology makes use of a cycle-accurate simulator validated against our real hardware. Compared to a 180nm Pentium-III, using commodity PC memory system components, Raw performs within a factor of 2/spl times/ for sequential applications with a very low degree of ILP, about 2/spl times/ to 9/spl times/ better for higher levels of ILP, and 10/spl times/-100/spl times/ better when highly parallel applications are coded in a stream language or optimized by hand. The paper also proposes a new versatility metric and uses it to discuss the generality of Raw.

symposium on code generation and optimization | 2003

An infrastructure for adaptive dynamic optimization

Derek L. Bruening; Timothy Garnett; Saman P. Amarasinghe

Dynamic optimization is emerging as a promising approach to overcome many of the obstacles of traditional static compilation. But while there are a number of compiler infrastructures for developing static optimizations, there are very few for developing dynamic optimizations. We present a framework for implementing dynamic analyses and optimizations. We provide an interface for building external modules, or clients, for the DynamoRIO dynamic code modification system. This interface abstracts away many low-level details of the DynamoRIO runtime system while exposing a simple and powerful, yet efficient and lightweight API. This is achieved by restricting optimization units to linear streams of code and using adaptive levels of detail for representing instructions. The interface is not restricted to optimization and can be used for instrumentation, profiling, dynamic translation, etc. To demonstrate the usefulness and effectiveness of our framework, we implemented several optimizations. These improve the performance of some applications by as much as 40% relative to native execution. The average speedup relative to base DynamoRIO performance is 12%.

programming language design and implementation | 2000

Exploiting superword level parallelism with multimedia instruction sets

Samuel Larsen; Saman P. Amarasinghe

Increasing focus on multimedia applications has prompted the additionof multimedia extensions to most existing general purpose microprocessors. This added functionality comes primarily with the addition of short SIMD instructions. Unfortunately, access to these instructions is limited to in-line assembly and library calls. Generally, it has been assumed that vector compilers provide the most promising means of exploiting multimedia instructions. Although vectorization technology is well understood, it is inherently complex and fragile. In addition, it is incapable of locating SIMD-style parallelism within a basic block. In this paper we introduce the concept of Superword Level Parallelism (SLP) ,a novel way of viewing parallelism in multimedia and scientific applications. We believe SLPP is fundamentally different from the loop level parallelism exploited by traditional vector processing, and therefore demands a new method of extracting it. We have developed a simple and robust compiler for detecting SLPP that targets basic blocks rather than loop nests. As with techniques designed to extract ILP, ours is able to exploit parallelism both across loop iterations and within basic blocks. The result is an algorithm that provides excellent performance in several application domains. In our experiments, dynamic instruction counts were reduced by 46%. Speedups ranged from 1.24 to 6.70.

architectural support for programming languages and operating systems | 2002

A stream compiler for communication-exposed architectures

Michael I. Gordon; William Thies; Michal Karczmarek; Jasper Lin; Ali S. Meli; Andrew A. Lamb; Chris L. Leger; Jeremy Wong; Henry Hoffmann; David Maze; Saman P. Amarasinghe

With the increasing miniaturization of transistors, wire delays are becoming a dominant factor in microprocessor performance. To address this issue, a number of emerging architectures contain replicated processing units with software-exposed communication between one unit and another (e.g., Raw, SmartMemories, TRIPS). However, for their use to be widespread, it will be necessary to develop compiler technology that enables a portable, high-level language to execute efficiently across a range of wire-exposed architectures.In this paper, we describe our compiler for StreamIt: a high-level, architecture-independent language for streaming applications. We focus on our backend for the Raw processor. Though StreamIt exposes the parallelism and communication patterns of stream programs, some analysis is needed to adapt a stream program to a software-exposed processor. We describe a partitioning algorithm that employs fission and fusion transformations to adjust the granularity of a stream graph, a layout algorithm that maps a stream graph to a given network topology, and a scheduling strategy that generates a fine-grained static communication pattern for each computational element.We have implemented a fully functional compiler that parallelizes StreamIt applications for Raw, including several load-balancing transformations. Using the cycle-accurate Raw simulator, we demonstrate that the StreamIt compiler can automatically map a high-level stream abstraction to Raw without losing performance. We consider this work to be a first step towards a portable programming model for communication-exposed architectures.

Explore More