Greg Stitt | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Greg Stitt is active.

Explore More

Publication

Featured researches published by Greg Stitt.

design automation conference | 2003

Dynamic hardware/software partitioning: a first approach

Greg Stitt; Roman L. Lysecky; Frank Vahid

Partitioning an application among software running on a microprocessor and hardware co-processor in on-chip configurable logic has been shown to improve performance and energy consumption in embedded systems. Meanwhile, dynamic software optimization methods have shown the usefulness and feasibility of runtime program optimization, but those optimizations do not achieve as much as partitioning. We introduce a first approach to dynamic hardware/software partitioning. We describe our system architecture and initial on-chip tools, including profiler, decompiler, synthesis, and placement and routing tools for a simplified configurable logic fabric, able to perform dynamic partitioning of real benchmarks. We show speedups averaging 2.6 for five benchmarks taken from Powerstone, Netbench and our own benchmarks.

design automation conference | 2006

Warp Processors

Roman L. Lysecky; Greg Stitt; Frank Vahid

We describe a new processing architecture, known as a warp processor, that utilizes a field-programmable gate array (FPGA) to improve the speed and energy consumption of a software binary executing on a microprocessor. Unlike previous approaches that also improve software using an FPGA but do so using a special compiler, a warp processor achieves these improvements completely transparently and operates from a standard binary. A warp processor dynamically detects the binarys critical regions, reimplements those regions as a custom hardware circuit in the FPGA, and replaces the software region by a call to the new hardware implementation of that region. While not all benchmarks can be improved using warp processing, many can, and the improvements are dramatically better than those achievable by more traditional architecture improvements. The hardest part of warp processing is that of dynamically reimplementing code regions on an FPGA, requiring partitioning, decompilation, synthesis, placement, and routing tools, all having to execute with minimal computation time and data memory so as to coexist on chip with the main processor. We describe the results of developing our warp processor. We developed a custom FPGA fabric specifically designed to enable lean place and route tools, and we developed extremely fast and efficient versions of partitioning, decompilation, synthesis, technology mapping, placement, and routing. Warp processors achieve overall application speedups of 6.3X with energy savings of 66p across a set of embedded benchmark applications. We further show that our tools utilize acceptably small amounts of computation and memory which are far less than traditional tools. Our work illustrates the feasibility and potential of warp processing, and we can foresee the possibility of warp processing becoming a feature in a variety of computing domains, including desktop, server, and embedded applications.

field programmable gate arrays | 2012

A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications

Jeremy Fowers; Greg Brown; Patrick Cooke; Greg Stitt

With the emergence of accelerator devices such as multicores, graphics-processing units (GPUs), and field-programmable gate arrays (FPGAs), application designers are confronted with the problem of searching a huge design space that has been shown to have widely varying performance and energy metrics for different accelerators, different application domains, and different use cases. To address this problem, numerous studies have evaluated specific applications across different accelerators. In this paper, we analyze an important domain of applications, referred to as sliding-window applications, when executing on FPGAs, GPUs, and multicores. For each device, we present optimization strategies and analyze use cases where each device is most effective. The results show that FPGAs can achieve speedup of up to 11x and 57x compared to GPUs and multicores, respectively, while also using orders of magnitude less energy.

ACM Transactions in Embedded Computing Systems | 2004

Energy savings and speedups from partitioning critical software loops to hardware in embedded systems

Greg Stitt; Frank Vahid; Shawn Nematbakhsh

We present results of extensive hardware/software partitioning experiments on numerous benchmarks. We describe our loop-oriented partitioning methodology for moving critical code from hardware to software. Our benchmarks included programs from PowerStone, MediaBench, and NetBench. Our experiments included estimated results for partitioning using an 8051 8-bit microcontroller or a 32-bit MIPS microprocessor for the software, and using on-chip configurable logic or custom application-specific integrated circuit hardware for the hardware. Additional experiments involved actual measurements taken from several physical implementations of hardware/software partitionings on real single-chip microprocessor/configurable-logic devices. We also estimated results assuming voltage scalable processors. We provide performance, energy, and size data for all of the experiments. We found that the benchmarks spent an average of 80% of their execution time in only 3% of their code, amounting to only about 200 bytes of critical code. For various experiments, we found that moving critical code to hardware resulted in average speedups of 3 to 5 and average energy savings of 35% to 70%, with average hardware requirements of only 5000 to 10,000 gates. To our knowledge, these experiments represent the most comprehensive hardware/software partitioning study published to date.

languages, compilers, and tools for embedded systems | 2010

Elastic computing: a framework for transparent, portable, and adaptive multi-core heterogeneous computing

John Robert Wernsing; Greg Stitt

Over the past decade, system architectures have started on a clear trend towards increased parallelism and heterogeneity, often resulting in speedups of 10x to 100x. Despite numerous compiler and high-level synthesis studies, usage of such systems has largely been limited to device experts, due to significantly increased application design complexity. To reduce application design complexity, we introduce elastic computing - a framework that separates functionality from implementation details by enabling designers to use specialized functions, called elastic functions, which enable an optimization framework to explore thousands of possible implementations, even ones using different algorithms. Elastic functions allow designers to execute the same application code efficiently on potentially any architecture and for different runtime parameters such as input size, battery life, etc. In this paper, we present an initial elastic computing framework that transparently optimizes application code onto diverse systems, achieving significant speedups ranging from 1.3x to 46x on a hyper-threaded Xeon system with an FPGA accelerator, a 16-CPU Opteron system, and a quad-core Xeon system.

IEEE Computer | 2008

Warp Processing: Dynamic Translation of Binaries to FPGA Circuits

Frank Vahid; Greg Stitt; Roman L. Lysecky

Warp processing dynamically and transparently transforms an executing microprocessors binary kernels into customized field-programmable gate array (FPGA) circuits, commonly resulting in 2X to 100X speedup over executing on microprocessors. A new architecture and set of dynamic CAD tools demonstrate warp processings potential.

languages compilers and tools for embedded systems | 2003

Profiling tools for hardware/software partitioning of embedded applications

Dinesh C. Suresh; Walid A. Najjar; Frank Vahid; Jason R. Villarreal; Greg Stitt

Loops constitute the most executed segments of programs and therefore are the best candidates for hardware software partitioning. We present a set of profiling tools that are specifically dedicated to loop profiling and do support combined function and loop profiling. One tool relies on an instruction set simulator and can therefore be augmented with architecture and micro-architecture features simulation while the other is based on compile-time instrumentation of gcc and therefore has very little slow down compared to the original program We use the results of the profiling to identify the compute core in each benchmark and study the effect of compile-time optimization on the distribution of cores in a program. We also study the potential speedup that can be achieved using a configurable system on a chip, consisting of a CPU embedded on an FPGA, as an example application of these tools in hardware/software partitioning.

IEEE Design & Test of Computers | 2002

Energy advantages of microprocessor platforms with on-chip configurable logic

Greg Stitt; Frank Vahid

System chips that incorporate configurable logic can reduce the energy consumed in executing software. The key is to use the configurable logic to execute performance-critical loops, producing average energy savings of 25% to 71% for embedded-system benchmarks.

international conference on computer aided design | 2002

Hardware/software partitioning of software binaries

Greg Stitt; Frank Vahid

Partitioning an embedded system application among a microprocessor and custom hardware has been shown to improve the performance, power or energy of numerous examples. The advent of single-chip microprocessor/FPGA platforms makes such partitioning even more attractive. Previous partitioning approaches have partitioned sequential program source code, such as C or C++. We introduce a new approach that partitions at the software binary level. Although source code partitioning is preferable from a purely technical viewpoint, binary-level partitioning provides several very practical benefits for commercial acceptance. We demonstrate that binary-level partitioning yields competitive speedup results compared to source-level partitioning, achieving an average speedup of 1.4 compared to 1.5 for eight benchmarks partitioned on a single-chip microprocessor/FPGA device.

international conference on hardware/software codesign and system synthesis | 2010

Intermediate fabrics: virtual architectures for circuit portability and fast placement and routing

James Coole; Greg Stitt

Although hardware/software partitioning of embedded applications onto FPGAs is widely known to have performance and power advantages, FPGA usage has been typically limited to hardware experts, due largely to several problems: 1) difficulty of integrating hardware design tools into well-established software tool flows, 2) increasingly lengthy FPGA design iterations due to placement and routing, and 3) a lack of portability and interoperability resulting from device/platform-specific tools and bitfiles. In this paper, we directly address the last two problems by introducing intermediate fabrics, which are virtual reconfigurable architectures specialized for different application domains, implemented on top of commercial-off-the-shelf devices. Such specialization enables near-instantaneous placement and routing by hiding the complexity of fine-grained physical devices, while also enabling circuit portability across all devices that implement the intermediate fabric. When combined with existing work on runtime synthesis from software binaries, intermediate fabrics reduce the effects of all three problems by enabling transparent usage of COTS FPGAs by software designers. In this paper, we explore intermediate fabric architectures using specialization techniques to minimize area and performance overhead of the virtual fabric while maximizing routability and speedup of placement and routing. We present results showing an average placement and routing speedup of 554×, with an average area overhead of 10% and clock overhead of 18%, which corresponds to an average frequency of 195 MHz.

Explore More