Michael K. Papamichael

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michael K. Papamichael is active.

Explore More

Publication

Featured researches published by Michael K. Papamichael.

ACM Transactions on Reconfigurable Technology and Systems | 2009

ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs

Eric S. Chung; Michael K. Papamichael; Eriko Nurvitadhi; James C. Hoe; Ken Mai; Babak Falsafi

Functional full-system simulators are powerful and versatile research tools for accelerating architectural exploration and advanced software development. Their main shortcoming is limited throughput when simulating large multiprocessor systems with hundreds or thousands of processors or when instrumentation is introduced. We propose the ProtoFlex simulation architecture, which uses FPGAs to accelerate full-system multiprocessor simulation and to facilitate high-performance instrumentation. Prior FPGA approaches that prototype a complete system in hardware are either too complex when scaling to large-scale configurations or require significant effort to provide full-system support. In contrast, ProtoFlex virtualizes the execution of many logical processors onto a consolidated number of multiple-context execution engines on the FPGA. Through virtualization, the number of engines can be judiciously scaled, as needed, to deliver on necessary simulation performance at a large savings in complexity. Further, to achieve low-complexity full-system support, a hybrid simulation technique called transplanting allows implementing in the FPGA only the frequently encountered behaviors, while a software simulator preserves the abstraction of a complete system. We have created a first instance of the ProtoFlex simulation architecture, which is an FPGA-based, full-system functional simulator for a 16-way UltraSPARC III symmetric multiprocessor server, hosted on a single Xilinx Virtex-II XCV2P70 FPGA. On average, the simulator achieves a 38x speedup (and as high as 49×) over comparable software simulation across a suite of applications, including OLTP on a commercial database server. We also demonstrate the advantages of minimal-overhead FPGA-accelerated instrumentation through a CMP cache simulation technique that runs orders-of-magnitude faster than software.

networks on chips | 2011

FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations

Michael K. Papamichael; James C. Hoe; Onur Mutlu

FIST (Fast Interconnect Simulation Techniques) is a fast and simple packet latency estimator to replace time-consuming detailed Network-on-Chip (NoC) models in full-system performance simulators. FIST combines ideas from analytical network modeling and execution-driven simulation models. The main idea is to abstractly model each router as a load-delay curve and sum load-dependent delay at each visited router to obtain a packets latency by tracking each routers load at runtime. The resulting latency estimator can accurately capture subtle load-dependent behaviors of a NoC but is much simpler than a full-blown execution-driven model. We study two variations of FIST in the context of a software-based, cycle-level simulation of a tiled chip-multiprocessor (CMP). We evaluate FISTs accuracy and performance relative to the CMP simulators original execution-driven 2D-mesh NoC model. A static FIST approach (trained offine using uniform random synthetic traffic) achieves less than 6% average error in packet latency and up to 43× average speedup for a 16×16 mesh. A dynamic FIST approach that adds periodic online training reduces the average packet latency error to less than 2% and still maintains an average speedup of up to 18× for a 16×16 mesh. Moreover, an FPGA-based realization of FIST can simulate 2D-mesh networks up to 24×24 nodes, at 3 to 4 orders of magnitude speedup over software-based simulators.

formal methods | 2011

Fast scalable FPGA-based Network-on-Chip simulation models

Michael K. Papamichael

This paper presents a set of two FPGA-based Network-on-Chip (NoC) simulation engines that composed the winning design of the 2011 MEMOCODE Design Contest in the absolute performance class. Both simulation engines were developed in Bluespec System Verilog (BSV) and were implemented on a Xilinx ML605 FPGA development board. For smaller networks and simpler router configurations a direct-mapped approach was employed, where the network to be simulated was directly implemented on the FPGA. For larger networks, where a direct-mapped approach is not feasible due to FPGA resource limitations, a virtualized time-multiplexed approach was used. Compared to the provided software reference implementation, our direct-mapped approach achieves three orders of magnitude speedup, while our virtualized time-multiplexed approach achieves one to two orders of magnitude speedup, depending on the network and router configuration.

field programmable gate arrays | 2012

Prototype and evaluation of the CoRAM memory architecture for FPGA-based computing

Eric S. Chung; Michael K. Papamichael; Gabriel Weisz; James C. Hoe; Ken Mai

The CoRAM memory architecture for FPGA-based computing augments traditional reconfigurable fabric with a natural and effective way for applications to interact with off-chip memory and I/O. The two central tenets of the CoRAM memory architecture are (1) the deliberate separation of concerns between computation versus data marshalling and (2) the use of a multithreaded software abstraction to replace FSM-based memory control logic. To evaluate the viability of the CoRAM memory architecture, we developed a full RTL implementation of a CoRAM microarchitecture instance that can be synthesized for standard cells or emulated on FPGAs. The results of our evaluation show that a soft emulation of the CoRAM memory architecture on current FPGAs can be impractical for memory-intensive, large-scale applications due to the high performance and area penalties incurred by the soft mechanisms. The results further show that in an envisioned FPGA built with CoRAM in mind, the introduction of hard macro blocks for data distribution can mitigate these inefficiencies---allowing applications to take advantage of the CoRAM memory architecture for ease of programmability and portability while still enjoying performance and efficiency comparable to RTL-level application development on conventional FPGAs.

international symposium on microarchitecture | 2011

Thread Cluster Memory Scheduling

Yoongu Kim; Michael K. Papamichael; Onur Mutlu; Mor Harchol-Balter

Memory schedulers in multicore systems should carefully schedule memory requests from different threads to ensure high system performance and fair, fast progress of each thread. No existing memory scheduler provides both the highest system performance and highest fairness. Thread Cluster Memory scheduling is a new algorithm that achieves the best of both worlds by differentiating latency-sensitive threads from bandwidth-sensitive ones and employing different scheduling policies for each.

design automation conference | 2015

Nautilus: fast automated IP design space search using guided genetic algorithms

Michael K. Papamichael; Peter A. Milder; James C. Hoe

Todays offerings of parameterized hardware IP generators permit very high degrees of performance and implementation customization. Nevertheless, it is ultimately still left to the IP users to set IP parameters to achieve the desired tuning effects. For the average IP user, the knowledge and effort required to navigate a complex IPs design space can significantly offset the productivity gain from using the IP. This paper presents an approach that builds into an IP generator an extended genetic algorithm (GA) to perform automatic IP parameter tuning. In particular, we propose extensions that allow IP authors to embed pertinent designer knowledge to improve GA performance. In the context of several IP generators, our evaluations show that (1) GA is an effective solution to this problem and (2) our modified IP author guided GA can reach the same quality of results up to an order of magnitude faster compared to the basic GA.

international conference on embedded computer systems: architectures, modeling, and simulation | 2007

Prototyping Efficient Interprocessor Communication Mechanisms

Vassilis Papaefstathiou; Dionisios N. Pnevmatikatos; Manolis Marazakis; Giorgos Kalokairinos; Aggelos Ioannou; Michael K. Papamichael; Stamatis G. Kavadias; Giorgos Mihelogiannakis; Manolis Katevenis

Parallel computing systems are becoming widespread and grow in sophistication. Besides simulation, rapid system prototyping becomes important in designing and evaluating their architecture. We present an efficient FPGA-based platform that we developed and use for research and experimentation on high speed interprocessor communication, network interfaces and interconnects. Our platform supports advanced communication capabilities such as remote DMA, remote queues, zero-copy data delivery and flexible notification mechanisms, as well as link bundling for increased performance. We report on the platform architecture, its design cost, complexity and performance (latency and throughput). We also report our experiences from implementing benchmarking kernels and a user-level benchmark application, and show how software can take advantage of the provided features, but also expose the weaknesses of the system.

international symposium on performance analysis of systems and software | 2015

DELPHI: a framework for RTL-based architecture design evaluation using DSENT models

Michael K. Papamichael; Cagla Cakir; Chen Suny Chia-Hsin; Owen Cheny; James C. Hoe; Ken Mai; Li-Shiuan Pehy; Vladimir Stojanovic

Computer architects are increasingly interested in evaluating their ideas at the register-transfer level (RTL) to gain more precise insights on the key characteristics (frequency, area, power) of a micro/architectural design proposal. However, the RTL synthesis process is notoriously tedious, slow, and errorprone and is often outside the area of expertise of a typical computer architect, as it requires familiarity with complex CAD flows, hard-to-get tools and standard cell libraries. The effort is further multiplied when targeting multiple technology nodes and standard cell variants to study technology dependence. This paper presents DELPHI, a flexible, open framework that leverages the DSENT modeling engine for faster, easier, and more efficient characterization of RTL hardware designs. DELPHI first synthesizes a Verilog or VHDL RTL design (either using the industry-standard Synopsys Design Compiler tool or a combination of open-source tools) to an intermediate structural netlist. It then processes the resulting synthesized netlist to generate a technology-independent DSENT design model. This model can then be used within a modified version of the DSENT flow to perform very fast-one to two orders of magnitude faster than full RTL synthesis-estimation of hardware performance characteristics, such as frequency, area, and power across a variety of DSENT technology models (e.g., 65nm Bulk, 32nm SOI, 11nm Tri-Gate, etc.). In our evaluation using 26 RTL design examples, DELPHI and DSENT were consistently able to closely track and capture design trends of conventional RTL synthesis results without the associated delay and complexity. We are releasing the full DELPHI framework (including a fully open-source flow) at http://www.ece.cmu.edu/CALCM/delphi/.

field-programmable custom computing machines | 2013

ShrinkWrap: Compiler-Enabled Optimization and Customization of Soft Memory Interconnects

Eric S. Chung; Michael K. Papamichael

Todays FPGAs lack dedicated on-chip memory interconnects, requiring users to (1) rely on inefficient, general-purpose solutions, or (2) tediously create an application-specific memory interconnect for each target platform. The CoRAM architecture, which offers a general-purpose abstraction for FPGA memory management, encodes high-level application information that can be exploited to generate customized soft memory interconnects. This paper describes the ShrinkWrap Compiler, which analyzes a CoRAM application for its connectivity and bandwidth requirements, enabling synthesis of highly-tuned area-efficient soft memory interconnects.

field programmable gate arrays | 2013

Cross-platform FPGA accelerator development using CoRAM and CONNECT

Eric S. Chung; Michael K. Papamichael; Gabriel Weisz; James C. Hoe

The CoRAM memory architecture is an easy-to-use and portable abstraction for FPGA accelerator development [1, 2]. Using the CoRAM framework, FPGA developers can write their applications once and re-target them automatically to different FPGA platforms and devices (e.g., Xilinx ML605, Altera DE4, ZYNQ-702, etc). In this tutorial, participants will learn the key concepts of the CoRAM Virtual Architecture and the underlying CONNECT Network-on-Chip generation framework [3]. The tutorial is organized into three parts. The first part will provide an overview of the CoRAM Virtual Architecture and include a hands-on section where participants will work on a small example to get first-hand experience with the CoRAM development flow. The second part of the tutorial will provide a beneath-the-hood look at CoRAM and cover more advanced topics. These topics include memory loading, user I/O, debugging, as well as a segment on the CONNECT NoC generation framework which serves as the on-chip interconnect for CoRAM. The final part of the tutorial will be devoted to more advanced exercises and demos, as well as a Q&A session for CoRAM and CONNECT. The tutorial assumes a basic understanding of RTL design and C programming. To join in on the hands-on exercise, the attendees need laptops with 15GB of free space and VirtualBox installed. Please visit http://www.ece.cmu.edu/~coram for information about CoRAM and updates on this tutorial.

Explore More