Is this you? Create Your Porfile

Walter Lee

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Walter Lee is active.

Explore More

Publication

Featured researches published by Walter Lee.

international symposium on microarchitecture | 2002

The Raw microprocessor: a computational fabric for software circuits and general-purpose programs

Michael Bedford Taylor; Jason Kim; Jason Miller; David Wentzlaff; Fae Ghodrat; Ben Greenwald; Henry Hoffman; Paul Johnson; Jaewook Lee; Walter Lee; Albert Ma; Arvind Saraf; Mark Seneski; Nathan Shnidman; Volker Strumpen; Matthew I. Frank; Saman P. Amarasinghe; Anant Agarwal

Wire delay is emerging as the natural limiter to microprocessor scalability. A new architectural approach could solve this problem, as well as deliver unprecedented performance, energy efficiency and cost effectiveness. The Raw microprocessor research prototype uses a scalable instruction set architecture to attack the emerging wire-delay problem by providing a parallel, software interface to the gate, wire and pin resources of the chip. An architecture that has direct, first-class analogs to all of these physical resources will ultimately let programmers achieve the maximum amount of performance and energy efficiency in the face of wire delay.

IEEE Computer | 1997

Baring it all to software: Raw machines

Elliot Waingold; Michael Bedford Taylor; Devabhaktuni Srikrishna; Vivek Sarkar; Walter Lee; Victor Lee; Jang Kim; Matthew I. Frank; Peter Finch; Rajeev Barua; Jonathan Babb; Saman P. Amarasinghe; Anant Agarwal

The most radical of the architectures that appear in this issue are Raw processors-highly parallel architectures with hundreds of very simple processors coupled to a small portion of the on-chip memory. Each processor, or tile, also contains a small bank of configurable logic, allowing synthesis of complex operations directly in configurable hardware. Unlike the others, this architecture does not use a traditional instruction set architecture. Instead, programs are compiled directly onto the Raw hardware, with all units told explicitly what to do by the compiler. The compiler even schedules most of the intertile communication. The real limitation to this architecture is the efficacy of the compiler. The authors demonstrate impressive speedups for simple algorithms that lend themselves well to this architectural model, but whether this architecture will be effective for future workloads is an open question.

international symposium on computer architecture | 2004

Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Michael Bedford Taylor; James Psota; Arvind Saraf; Nathan Shnidman; Volker Strumpen; Matthew I. Frank; Saman P. Amarasinghe; Anant Agarwal; Walter Lee; Jason E. Miller; David Wentzlaff; Ian Rudolf Bratt; Ben Greenwald; Henry Hoffmann; Paul Johnson; Jason Kim

This paper evaluates the Raw microprocessor. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance in the face of increasing wire delays. Raw approaches this challenge by implementing plenty of on-chip resources - including logic, wires, and pins - in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Raw supports both ILP and streams by routing operands between architecturally-exposed functional units over a point-to-point scalar operand network. This network offers low latency for scalar data transport. Raw manages the effect of wire delays by exposing the interconnect and using software to orchestrate both scalar and stream data transport. We have implemented a prototype Raw microprocessor in IBMs 180 nm, 6-layer copper, CMOS 7SF standard-cell ASIC process. We have also implemented ILP and stream compilers. Our evaluation attempts to determine the extent to which Raw succeeds in meeting its goal of serving as a more versatile, general-purpose processor. Central to achieving this goal is Raws ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Specifically, we evaluate the performance of Raw on a diverse set of codes including traditional sequential programs, streaming applications, server workloads and bit-level embedded computation. Our experimental methodology makes use of a cycle-accurate simulator validated against our real hardware. Compared to a 180nm Pentium-III, using commodity PC memory system components, Raw performs within a factor of 2/spl times/ for sequential applications with a very low degree of ILP, about 2/spl times/ to 9/spl times/ better for higher levels of ILP, and 10/spl times/-100/spl times/ better when highly parallel applications are coded in a stream language or optimized by hand. The paper also proposes a new versatility metric and uses it to discuss the generality of Raw.

architectural support for programming languages and operating systems | 1998

Space-time scheduling of instruction-level parallelism on a raw machine

Walter Lee; Rajeev Barua; Matthew I. Frank; Devabhaktuni Srikrishna; Jonathan Babb; Vivek Sarkar; Saman P. Amarasinghe

Increasing demand for both greater parallelism and faster clocks dictate that future generation architectures will need to decentralize their resources and eliminate primitives that require single cycle global communication. A Raw microprocessor distributes all of its resources, including instruction streams, register files, memory ports, and ALUs, over a pipelined two-dimensional mesh interconnect, and exposes them fully to the compiler. Because communication in Raw machines is distributed, compiling for instruction-level parallelism (ILP) requires both spatial instruction partitioning as well as traditional temporal instruction scheduling. In addition, the compiler must explicitly manage all communication through the interconnect, including the global synchronization required at branch points. This paper describes RAWCC, the compiler we have developed for compiling general-purpose sequential programs to the distributed Raw architecture. We present performance results that demonstrate that although Raw machines provide no mechanisms for global communication the Raw compiler can schedule to achieve speedups that scale with the number of available functional units.

high-performance computer architecture | 2003

Scalar operand networks: on-chip interconnect for ILP in partitioned architectures

M. Bedford Taylor; Walter Lee; Saman P. Amarasinghe; Anant Agarwal

The bypass paths and multiported register files in microprocessors serve as an implicit interconnect to communicate operand values among pipeline stages and multiple ALU. Previous superscalar designs implemented this interconnect using centralized structures that do not scale with increasing ILP demands. In search of scalability, recent microprocessor designs in industry and academia exhibit a trend towards distributed resources such as partitioned register files, banked caches, multiple independent compute pipelines, and even multiple program counters. Some of these partitioned microprocessor designs have begun to implement bypassing and operand transport using point-to-point interconnects rather than centralized networks. We call interconnects optimized for scalar data transport, whether centralized or distributed, scalar operand networks. Although these networks share many of the challenges of multiprocessor networks such as scalability and deadlock avoidance, they have many unique requirements, including ultra-low latencies (a few cycles versus tens of cycles) and ultra-fast operation-operand matching. This paper discusses the unique properties of scalar operand networks, examines alternative ways of implementing them, and describes in detail the implementation of one such network in the Raw microprocessor. The paper analyzes the performance of these networks for ILP workloads and the sensitivity of overall ILP performance to network properties.

field programmable custom computing machines | 1999

Parallelizing applications into silicon

Jonathan Babb; Martin C. Rinard; Csaba Andras Moritz; Walter Lee; Matthew I. Frank; Rajeev Barua; Saman P. Amarasinghe

The next decade of computing will be dominated by embedded systems, information appliances and application-specific computers. In order to build these systems, designers will need high-level compilation and CAD tools that generate architectures that effectively meet the needs of each application. In this paper we present a novel compilation system that allows sequential programs, written in C or FORTRAN, to be compiled directly into custom silicon or reconfigurable architectures. This capability is also interesting because trends in computer architecture are moving towards more reconfigurable hardware-like substrates, such as FPGA based systems. Our system works by successfully combining two resource-efficient computing disciplines: Small Memories and Virtual Wires. For a given application, the compiler first analyzes the memory access patterns of pointers and arrays in the program and constructs a partitioned memory system made up of many small memories. The computation is implemented by active computing elements that are spatially distributed within the memory array. A space-time scheduler assigns instructions to the computing elements in a way that maximizes locality and minimizes physical communication distance. It also generates an efficient static schedule for the interconnect. Finally, specialized hardware for the resulting schedule of memory accesses, wires, and computation is generated as a multi-process state machine in synthesizable Verilog. With this system, implemented as a set of SUIF compiler passes, we have successfully compiled programs into hardware and achieve specialization performance enhancements by up to an order of magnitude versus a single general purpose processor. We also achieve additional parallelization speedups similar to those obtainable using a tightly-interconnected multiprocessor.

IEEE Transactions on Parallel and Distributed Systems | 2005

Scalar operand networks

Michael Bedford Taylor; Walter Lee; Saman P. Amarasinghe; Anant Agarwal

The bypass paths and multiported register files in microprocessors serve as an implicit interconnect to communicate operand values among pipeline stages and multiple ALUs. Previous superscalar designs implemented this interconnect using centralized structures that do not scale with increasing ILP demands. In search of scalability, recent microprocessor designs in industry and academia exhibit a trend toward distributed resources such as partitioned register files, banked caches, multiple independent compute pipelines, and even multiple program counters. Some of these partitioned microprocessor designs have begun to implement bypassing and operand transport using point-to-point interconnects. We call interconnects optimized for scalar data transport, whether centralized or distributed, scalar operand networks. Although these networks share many of the challenges of multiprocessor networks such as scalability and deadlock avoidance, they have many unique requirements, including ultra-low latency (a few cycles versus tens of cycles) and ultra-fast operation-operand matching. This work discusses the unique properties of scalar operand networks (SONs), examines alternative ways of implementing them, and introduces the AsTrO taxonomy to distinguish between them. It discusses the design of two alternative networks in the context of the Raw microprocessor, and presents timing, area, and energy statistics for a real implementation. The paper also presents a 5-tuple performance model for SONs and analyzes their performance sensitivity to network properties for ILP workloads.

international solid-state circuits conference | 2003

A 16-issue multiple-program-counter microprocessor with point-to-point scalar operand network

Michael Bedford Taylor; Jang Kim; Jason Miller; David Wentzlaff; Fae Ghodrat; Ben Greenwald; Henry Hoffman; Paul Johnson; Walter Lee; Arvind Saraf; Nathan Shnidman; Volker Strumpen; Saman P. Amarasinghe; Anant Agarwal

This microprocessor explores an architectural solution to scalability problems in scalar operand networks. The 0.15/spl mu/m 6M process, 331 mm/sup 2/ research prototype issues 16 unique instructions per cycle and uses an on-chip point-to-point scalar operand network to transfer operands among distributed functional units.

IEEE Transactions on Computers | 2001

Compiler support for scalable and efficient memory systems

Rajeev Barua; Walter Lee; S. Arnarasinghe; Anant Agarwal

Technological trends require that future scalable microprocessors be decentralized. Applying these trends toward memory systems shows that the size of the cache accessible in a single cycle will decrease in a future generation of chips. Thus, a bank-exposed memory system comprised of small, decentralized cache banks must eventually replace that of a monolithic cache. This paper considers how to effectively use such a memory system for sequential programs. This paper presents Maps, the software technology central to bank-exposed architectures, which are architectures with bank-exposed memory systems. Maps solves the problem of bank disambiguation-that of determining at compile-time which bank a memory reference is accessing. Bank disambiguation is important because it enables the compile-time optimization for data locality, where data can be placed close to the computation that requires it. Two methods for bank disambiguation are presented: equivalence-class unification and modulo unrolling. Experimental results are presented using a compiler for the MIT Raw machine, a bank-exposed architecture that relies on the compiler to 1) manage its memory and 2) orchestrate its instruction level parallelism and communication. Results on Raw using sequential codes demonstrate that using bank disambiguation improves performance, by a factor of 3 to 5 over using ILP alone.

job scheduling strategies for parallel processing | 1997

Implications of I/O for Gang Scheduled Workloads

Walter Lee; Matthew I. Frank; Victor Lee; Kenneth Mackenzie; Larry Rudolph

The job workloads of general-purpose multiprocessors usually include both compute-bound parallel jobs, which often require gang scheduling, as well as I/O-bound jobs, which require high CPU priority for the individual gang members of the job in order to achieve interactive response times. Our results indicate that an effective interactive multiprocessor scheduler must be flexible and tailor the priority, time quantum, and extent of gang scheduling to the individual needs of each job.

Explore More