Peter Yiannacouras | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Peter Yiannacouras is active.

Explore More

Publication

Featured researches published by Peter Yiannacouras.

compilers, architecture, and synthesis for embedded systems | 2005

The microarchitecture of FPGA-based soft processors

Peter Yiannacouras; Jonathan Rose; J. Gregory Steffan

As more embedded systems are built using FPGA platforms, there is an increasing need to support processors in FPGAs. One option is the soft processor, a programmable instruction processor implemented in the reconfigurable logic of the FPGA. Commercial soft processors have been widely deployed, and hence we are motivated to understand their microarchitecture. We must re-evaluate microarchiteture in the soft processor context because an FPGA platform is significantly different than an ASIC platform---for example, the relative speed of memory and logic is quite different in the two platforms, as is the area cost. In this paper we present an infrastructure for rapidly generating RTL models of soft processors, as well as a methodology for measuring their area, performance, and power. Using our automatically-generated soft processors we explore the microarchitecture trade-off space including: (i) hardware vs software multiplication support; (ii) shifter implementations; and (iii) pipeline depth, organization, and forwarding. For example, we find that a 3-stage pipeline has better wall-clock-time performance than deeper pipelines, despite lower clock frequency. We also compare our designs to Alteras NiosII commercial soft processor variations and find that our automatically generated designs span the design space while remaining very competitive.

compilers, architecture, and synthesis for embedded systems | 2008

VESPA: portable, scalable, and flexible FPGA-based vector processors

Peter Yiannacouras; J. Gregory Steffan; Jonathan Rose

While soft processors are increasingly common in FPGA-based embedded systems, it remains a challenge to scale their performance. We propose extending soft processor instruction sets to include support for vector processing. The resulting system of vectorized software and soft vector processor hardware is (i) portable to any FPGA architecture and vector processor configuration, (ii) scalable to larger yet higher-performance designs, and (iii) flexible, allowing the underlying vector processor to be customized to match the needs of each application. Using our robust and verified parameterized vector processor design and industry-standard EEMBC benchmarks, we evaluate the performance and area trade-offs for different soft vector processor configurations using an FPGA development platform with DDR SDRAM. We find that on average we can scale performance from 1.8x up to 6.3x for a vector processor design that saturates the capacity of our platforms Stratix 1S80 FPGA. We also automatically generate application-specific vector processors with reduced datapath width and instruction set support which combined reduce the area by up to 70% (61% on average) without affecting performance.

field programmable gate arrays | 2006

Application-specific customization of soft processor microarchitecture

Peter Yiannacouras; J. Gregory Steffan; Jonathan Rose

A key advantage of soft processors (processors built on an FPGA programmable fabric) over hard processors is that they can be customized to suit an application programs specific software. This notion has been exploited in the past principally through the use of application-specific instructions. While commercial soft processors are now widely deployed, they are available in only a few microarchitectural variations. In this work we explore the advantage of tuning the processors microarchitecture to specific software applications, and show that there are significant advantages in doing so.Using an infrastructure for automatically generating soft processors that span the area/speed design space (while remaining competitive with Alteras Nios II variations), we explore the impact of tuning several aspects of microarchitecture including: (i) hardware vs software multiplication support; (ii) shifter implementation; and (iii) pipeline depth, organization, and forwarding. We find that the processor design that is fastest overall (on average across our embedded benchmark applications) is often also the fastest design for an individual application. However, in terms of area efficiency (i.e., performance-per-area), we demonstrate that a tuned microarchitecture can offer up to 30% improvement for three of the benchmarks and on average 11.4% improvement over the fastest-on-average design. We also show that our benchmark applications use only 50% of the available instructions on average, and that a processor customized to support only that subset of the ISA for a specific application can on average offer 25% savings in both area and energy. Finally, when both techniques for customization are combined we obtain an average improvement in performance-per-area of 25%.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2007

Exploration and Customization of FPGA-Based Soft Processors

Peter Yiannacouras; J. G. Steffan; Jonathan Rose

As embedded systems designers increasingly use field-programmable gate arrays (FPGAs) while pursuing single-chip designs, they are motivated to have their designs also include soft processors, processors built using FPGA programmable logic. In this paper, we provide: 1) an exploration of the microarchitectural tradeoffs for soft processors and 2) a set of customization techniques that capitalizes on these tradeoffs to improve the efficiency of soft processors for specific applications. Using our infrastructure for automatically generating soft-processor implementations (which span a large area/speed design space while remaining competitive with Alteras Nios II variations), we quantify tradeoffs within soft-processor microarchitecture and explore the impact of tuning the microarchitecture to the application. In addition, we apply a technique of subsetting the instruction set to use only the portion utilized by the application. Through these two techniques, we can improve the performance-per-area of a soft processor for a specific application by an average of 25%

field programmable gate arrays | 2007

An FPGA-based Pentium® in a complete desktop system

Shih-Lien L. Lu; Peter Yiannacouras; Rolf Kassa; Michael Konow; Taeweon Suh

Software simulation has been the predominant method for architects to evaluate microprocessor research proposals. There are three tenets in modeling new designs with software models: simulation speed, model accuracy and model completeness. The increasing complexity of the processor and accelerated trend to have multiple processors on a chip are putting burden on simulators to achieve all tenets mentioned, including accurately capturing OS effects. In this work we perform preliminary experimentation/prototyping with an emulation system which overcomes the tension to satisfy all three requirements. The system is an original Socket-7 based desktop processor system with typical hardware peripherals running modern operating systems such as Fedora Core 4 and Windows XP; however we have inserted a Xilinx Virtex-4 in place of the processor that should sit in the motherboard and have used the Virtex-4 to host a complete version of the Pentium® microprocessor (which consumes less than half its resources). We can therefore apply architectural changes to the processor and evaluate their effects on the complete desktop system. We use this FPGA-based emulation system to conduct preliminary architectural experiments including growing the branch target buffer and the level 1 caches. In addition, we experimented with interfacing hardware accelerators such as DES and AES engines which resulted in 27x speedups.

field-programmable technology | 2003

A parameterized automatic cache generator for FPGAs

Peter Yiannacouras; Jonathan Rose

Caches in FPGAs can improve the performance of soft processors and other applications beset by slow storage components. In this paper we present a cache generator which can produce caches with a variety of associativities, latencies, and dimensions. This tool allows system designers to effortlessly create, and investigate different caches in order to better meet the needs of their target system. The effect of these three parameters on the area and speed of the caches is also examined and we show that the designs can meet a wide range of specifications and are in general fast and compact.

compilers, architecture, and synthesis for embedded systems | 2009

Fine-grain performance scaling of soft vector processors

Peter Yiannacouras; J. Gregory Steffan; Jonathan Rose

Embedded systems are often implemented on FPGA devices and 25% of the time include a soft processor--a processor built using the FPGA reprogrammable fabric. Because of their prevalence and flexibility, soft processors are compelling targets for customization--although current soft processors provide few architectural variations. Recent work has proposed augmenting soft processors with customizable vector processing support, enabling designers to easily scale performance by exploiting the data parallelism available in an application. However this approach provides only coarse-grain scaling, by successively doubling the number of vector datapaths for less than double the performance. In this work we further augment soft vector processors with more fine-grain architectural modifications: we add support for (i) vector chaining and (ii) heterogeneous vector lanes, allowing the soft vector processor to be customized to not only the data-level parallelism available in an application, but to the functional unit demand. We evaluate the area and wall clock performance with full hardware implementations on state-of-the-art FPGAs and find that chaining can provide between 15-45% average performance for less area than doubling the lanes, and that heterogeneous lanes can save 6-13% area with little or no performance loss in some cases. Finally, we implement 1200 soft vector processors variants and find that the peak performance per area compared to our base vector processor can be increased by an average of 13% and up to 34% when choosing the best variant per application.

field programmable custom computing machines | 2008

Scaling Soft Processor Systems

Martin Labrecque; Peter Yiannacouras; J. G. Steffan

As FPGA-based systems including soft-processors become increasingly common we are motivated to better understand the best way to scale the performance of such systems. In this paper we explore the organization of processors and caches connected to a single off-chip memory channel, for workloads composed of many independent threads. In particular we design and evaluate real FPGA-based processor, multithreaded processor, and multiprocessor systems on EEMBC benchmarks - investigating different approaches to scaling caches, processors, and thread contexts to maximize throughput while minimizing area. Our main finding is that while a single multithreaded processor offers improved performance over a single-threaded processor, multiprocessors composed of single-threaded processors scale better than those composed of multithreaded processors.

IEEE Transactions on Very Large Scale Integration Systems | 2012

Portable, Flexible, and Scalable Soft Vector Processors

Peter Yiannacouras; J. G. Steffan; Jonathan Rose

Field-programmable gate arrays (FPGAs) are increasingly used to implement embedded digital systems, however, the hardware design necessary to do so is time-consuming and tedious. The amount of hardware design can be reduced by employing a microprocessor for less-critical computation in the system. Often this microprocessor is implemented using the FPGA reprogrammable fabric as a soft processor which presently have simple architectures and moderate performance. Our goal is to scale the performance of existing soft processors hence expanding their suitability to more critical computation. To this end we propose extending soft processors with vector extensions to exploit the abundant data parallelism found in many embedded kernels. Such a soft vector processor can execute these kernels much faster than a single-core hence reducing the need for hardware implementations. We observe this improved execution speed through experimentation with vector extended soft processor architecture (VESPA) which is designed, implemented, and evaluated on real FPGA hardware. VESPA is shown to effectively scale performance up to 32 lanes, while providing substantial architectural flexibility to create a fine-grained design space. With these characteristics, and portability across FPGA devices, soft vector processors can provide exact-fit architectures which can efficiently and more easily implement data parallel workloads over custom FPGA hardware design.

field-programmable logic and applications | 2009

Data parallel FPGA workloads: Software versus hardware

Peter Yiannacouras; J. Gregory Steffan; Jonathan Rose

Commercial soft processors are unable to effectively exploit the data parallelism present in many embedded systems workloads, requiring FPGA designers to exploit it (laboriously) with manual hardware design. Recent research [1, 2] has demonstrated that soft processors augmented with support for vector instructions provide significant improvements in performance and scalability for dataparallel workloads. These soft vector processors provide a software environment for quickly encoding data parallel computation, but their competitiveness with manual hardware design in terms of area and performance remains unknown. In this work, using an FPGA platform equipped with DDR memory executing data-parallel EEMBC embedded benchmarks, we measure the area/performance gaps between (i) a scalar soft processor, (ii) our improved soft vector processor, and (iii) custom FPGA hardware. We demonstrate that the 432x wall clock performance gap between scalar executed C and custom hardware can be reduced significantly to 17x using our improved soft vector processor, while silicon-efficiency is improved by 3x in terms of area-delay product. We modified the architecture to mitigate three key advantages we observed in custom hardware: loop overhead, data delivery, and exact resource usage. Combined these improvements increase performance by 3x and reduce area by almost half, significantly reducing the need for designers to resort to more challenging custom hardware implementations.

Explore More