Is this you? Create Your Porfile

Michael Pellauer

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michael Pellauer is active.

Explore More

Publication

Featured researches published by Michael Pellauer.

high-performance computer architecture | 2011

HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing

Michael Pellauer; Michael Adler; Michel A. Kinsy; Angshuman Parashar; Joel S. Emer

In this paper we present the HAsim FPGA-accelerated simulator. HAsim is able to model a shared-memory multicore system including detailed core pipelines, cache hierarchy, and on-chip network, using a single FPGA. We describe the scaling techniques that make this possible, including novel uses of time-multiplexing in the core pipeline and on-chip network. We compare our time-multiplexed approach to a direct implementation, and present a case study that motivates why high-detail simulations should continue to play a role in the architectural exploration process.

field programmable gate arrays | 2011

Leap scratchpads: automatic memory and cache management for reconfigurable logic

Michael Adler; Kermin Fleming; Angshuman Parashar; Michael Pellauer; Joel S. Emer

Developers accelerating applications on FPGAs or other reconfigurable logic have nothing but raw memory devices in their standard toolkits. Each project typically includes tedious development of single-use memory management. Software developers expect a programming environment to include automatic memory management. Virtual memory provides the illusion of very large arrays and processor caches reduce access latency without explicit programmer instructions. LEAP scratchpads for reconfigurable logic dynamically allocate and manage multiple, independent, memory arrays in a large backing store. Scratchpad accesses are cached automatically in multiple levels, ranging from shared on-board, RAM-based, set-associative caches to private caches stored in FPGA RAM blocks. In the LEAP framework, scratchpads share the same interface as on-die RAM blocks and are plug-in replacements. Additional libraries support heap management within a storage set. Like software developers, accelerator authors using scratchpads may focus more on core algorithms and less on memory management.

international symposium on computer architecture | 2013

Triggered instructions: a control paradigm for spatially-programmed architectures

Angshuman Parashar; Michael Pellauer; Michael Adler; Bushra Ahsan; Neal Clayton Crago; Daniel Lustig; Vladimir Pavlov; Antonia Zhai; Mohit Gambhir; Aamer Jaleel; Randy L. Allmon; Rachid Rayess; Stephen Maresh; Joel S. Emer

In this paper, we present triggered instructions, a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to transition concisely between states without explicit branch instructions. They also allow efficient reactivity to inter-PE communication traffic. The approach provides a unified mechanism to avoid over-serialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading, which each require distinct hardware mechanisms in a traditional sequential architecture. Our analysis shows that a triggered-instruction based spatial accelerator can achieve 8X greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64% respectively over a program-counter style spatial baseline, resulting in a speedup of 2.0X.

international symposium on performance analysis of systems and software | 2008

Quick Performance Models Quickly: Closely-Coupled Partitioned Simulation on FPGAs

Michael Pellauer; Muralidaran Vijayaraghavan; Michael Adler; Arvind; Joel S. Emer

In this paper we explore microprocessor performance models implemented on FPGAs. While FPGAs can help with simulation speed, the increased implementation complexity can degrade model development time. We assess whether a simulator split into closely-coupled timing and functional partitions can address this by easing the development of timing models while retaining fine-grained parallelism. We give the semantics of our simulator partitioning, and discuss the architecture of its implementation on an FPGA. We describe how three timing models of vastly different target processors can use the same functional partition, and assess their performance.

international conference on formal methods and models for co-design | 2007

Hardware Acceleration of Matrix Multiplication on a Xilinx FPGA

Nirav Dave; Kermin Fleming; Myron King; Michael Pellauer; Muralidaran Vijayaraghavan

The first MEMOCODE hardware/software co-design contest posed the following problem: optimize matrix-matrix multiplication in such a way that it is split between the FPGA and PowerPC on a Xilinx Virtex IIPro30. In this paper we discuss our solution, which we implemented on a Xilinx XUP development board with 256 MB of DRAM. The design was done by the five authors over a span of approximately 3 weeks, though of the 15 possible man-weeks, about 9 were actually spent working on this problem. All hardware design was done using Blue-spec SystemVerilog (BSV), with the exception of an imported Verilog multiplication unit, necessary only due to the limitations of the Xilinx FPGA toolflow optimizations.

field programmable gate arrays | 2008

A-Ports: an efficient abstraction for cycle-accurate performance models on FPGAs

Michael Pellauer; Muralidaran Vijayaraghavan; Michael Adler; Arvind; Joel S. Emer

Recently there has been interest in using FPGAs as a platform for cycle-accurate performance models. We discuss how the properties of FPGAs make them a good platform to achieve a performance improvement over software models. Some metrics are developed to gain insight into the strengths and weaknesses of different simulation methodologies. This paper introduces A-Ports, a distributed, efficient simulation scheme for creating cycle-accurate performance models on FPGAs. Finally, we quantitatively demonstrate an average performance improvement of 19% using A-Ports over other FPGA-based simulation schemes

international conference on formal methods and models for co design | 2005

Synthesis of synchronous assertions with guarded atomic actions

Michael Pellauer; Mieszko Lis; Donald Baltus; Rishiyur S. Nikhil

The SystemVerilog standard introduces SystemVerilog Assertions (SVA), a synchronous assertion package based on the temporal-logic semantics of PSL. Traditionally assertions are checked in software simulation. We introduce a method for synthesizing SVA directly into hardware modules in Bluespec SystemVerilog. This opens up new possibilities for FPGA-accelerated testbenches, hardware/software co-emulation, dynamic verification and fault-tolerance. We describe adding synthesizable assertions to a cache controller, and investigate their hardware cost.

field programmable gate arrays | 2013

Heracles: a tool for fast RTL-based design space exploration of multicore processors

Michel A. Kinsy; Michael Pellauer; Srinivas Devadas

This paper presents Heracles, an open-source, functional, parameterized, synthesizable multicore system toolkit. Such a multi/many-core design platform is a powerful and versatile research and teaching tool for architectural exploration and hardware-software co-design. The Heracles toolkit comprises the soft hardware (HDL) modules, application compiler, and graphical user interface. It is designed with a high degree of modularity to support fast exploration of future multicore processors of dierent topologies, routing schemes, processing elements (cores), and memory system organizations. It is a component-based framework with parameterized interfaces and strong emphasis on module reusability. The compiler toolchain is used to map C or C++ based applications onto the processing units. The GUI allows the user to quickly congure and launch a system instance for easy factorial development and evaluation. Hardware modules are implemented in synthesizable Verilog and are FPGA platform independent. The Heracles tool is freely available under the open-source MIT license at: http://projects.csail.mit.edu/heracles.

international conference on formal methods and models for co design | 2006

802.11a transmitter: a case study in microarchitectural exploration

Nirav Dave; Michael Pellauer; S. Gerding; Arvind

Hand-held devices have rigid constraints regarding power dissipation and energy consumption. Whether a new functionality can be supported often depends upon its power requirements. Concerns about the area (or cost) are generally addressed after a design can meet the performance and power requirements. Different micro-architectures have very different area, timing and power characteristics, and these need RTL-level models to be evaluated. In this paper we discuss the microarchitectural exploration of an 802.11a transmitter via synthesizable and highly-parameterized descriptions written in Bluespec SystemVerilog (BSV). We also briefly discuss why such architectural exploration would be practically infeasible without appropriate linguistic facilities. No knowledge of 802.11a or BSV is needed to read this paper

field-programmable logic and applications | 2011

Heracles: Fully Synthesizable Parameterized MIPS-Based Multicore System

Michel A. Kinsy; Michael Pellauer; Srinivas Devadas

Heracles is an open-source complete multicore system written in Verilog. It is fully parameterized and can be reconfigured and synthesized into different topologies and sizes. Each processing node has a fully bypassed, 7-stage pipelined microprocessor running the MIPS-III ISA, a 4-stage input-buffer, virtual-channel router, and a local variable-size shared memory. Our design is highly modular with clear interfaces between the core, the memory hierarchy, and the on-chip network. In the baseline design, the microprocessor is attached to two caches, one instruction cache and one data cache, which are oblivious to the global memory organization. The memory system in Heracles can be configured as one single global shared memory (SM), or distributed shared memory (DSM), or any combination thereof. Each core is connected to the rest of the network of processors by a parameterized, realistic, wormhole router. We show different topology configurations of the system, and their synthesis results on the Xilinx Virtex-5 LX330T FPGA board. We also provide a small MIPS cross-compiler tool chain to assist in developing software for Heracles.

Explore More