Aaron Severance | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Aaron Severance is active.

Explore More

Publication

Featured researches published by Aaron Severance.

field-programmable custom computing machines | 2012

VENICE: A Compact Vector Processor for FPGA Applications

Aaron Severance; Guy Lemieux

VENICE is a new soft vector processor (SVP) for FPGA applications that is designed for maximum through-put with a small number (1 to 4) of ALUs. By increasing clock speed and eliminating bottlenecks in ALU utilization, VENICE achieves over 2x better performance-per-logic block than VEGAS, the previous best SVP. VENICE is also simpler to program, as its instructions use standard C pointers into a scratchpad memory rather than vector registers.

field programmable gate arrays | 2011

VEGAS: soft vector processor with scratchpad memory

Christopher Han-Yu Chou; Aaron Severance; Alex D. Brant; Zhiduo Liu; Saurabh Sant; Guy Lemieux

This paper presents VEGAS, a new soft vector architecture, in which the vector processor reads and writes directly to a scratchpad memory instead of a vector register file. The scratchpad memory is a more efficient storage medium than a vector register file, allowing up to 9x more data elements to fit into on-chip memory. In addition, the use of fracturable ALUs in VEGAS allow efficient processing of bytes, halfwords and words in the same processor instance, providing up to 4x the operations compared to existing fixed-width soft vector ALUs. Benchmarks show the new VEGAS architecture is 10x to 208x faster than Nios II and has 1.7x to 3.1x better area-delay product than previous vector work, achieving much higher throughput per unit area. To put this performance in perspective, VEGAS is faster than a leading-edge Intel processor at integer matrix multiply. To ease programming effort and provide full debug support, VEGAS uses a C macro API that outputs vector instructions as standard NIOS II/f custom instructions.

international conference on hardware/software codesign and system synthesis | 2013

Embedded supercomputing in FPGAs with the VectorBlox MXP matrix processor

Aaron Severance; Guy Lemieux

Embedded systems frequently use FPGAs to perform highly parallel data processing tasks. However, building such a system usually requires specialized hardware design skills with VHDL or Verilog. Instead, this paper presents the VectorBlox MXP Matrix Processor, an FPGA-based soft processor capable of highly parallel execution. Programmed entirely in C, the MXP is capable of executing data-parallel software algorithms at hardware-like speeds. For example, the MXP running at 200MHz or higher can implement a multi-tap FIR filter and output 1 element per clock cycle. MXPs parameterized design lets the user specify the amount of parallelism required, ranging from 1 to 128 or more parallel ALUs. Key features of the MXP include a parallel-access scratchpad memory to hold vector data and high-throughput DMA and scatter/gather engines. To provide extreme performance, the processor is expandable with custom vector instructions and custom DMA filters. Finally, the MXP seamlessly ties into existing Altera and Xilinx development flows, simplifying system creation and deployment.

field programmable gate arrays | 2014

Soft vector processors with streaming pipelines

Aaron Severance; Joe Edwards; Hossein Omidian; Guy Lemieux

Soft vector processors (SVPs) achieve significant performance gains through the use of parallel ALUs. However, since ALUs are used in a time-multiplexed fashion, this does not exploit a key strength of FPGA performance: pipeline parallelism. This paper shows how streaming pipelines can be integrated into the datapath of a SVP to achieve dramatic speedups. The SVP plays an important role in supplying the pipeline with high-bandwidth input data and storing its results using on-chip memory. However, the SVP must also perform the housekeeping tasks necessary to keep the pipeline busy. In particular, it orchestrates data movement between on-chip memory and external DRAM, it pre- or post-processes the data using its own ALUs, and it controls the overall sequence of execution. Since the SVP is programmed in C, these tasks are easier to develop and debug than using a traditional HDL approach. Using the N-body problem as a case study, this paper illustrates how custom streaming pipelines are integrated into the SVP datapath and multiple techniques for generating them. Using a custom pipeline, we demonstrate speedups over 7,000 times and performance-per-ALM over 100 times better than Nios II/f. The custom pipeline is also 50 times faster than a naive Intel Core i7 processor implementation.

field programmable gate arrays | 2012

Accelerator compiler for the VENICE vector processor

Zhiduo Liu; Aaron Severance; Satnam Singh; Guy Lemieux

This paper describes the compiler design for VENICE, a new soft vector processor (SVP). The compiler is a new back-end target for Microsoft Accelerator, a high-level data parallel library for C++ and C#. This allows us to automatically compile high-level programs into VENICE assembly code, thus avoiding the process of writing assembly code used by previous SVPs. Experimental results show the compiler can generate scalable parallel code with execution times that are comparable to hand-written VENICE assembly code. On data-parallel applications, VENICE at 100MHz on an Altera DE3 platform runs at speeds comparable to one core of a 3.5GHz Intel Xeon W3690 processor, beating it in performance on four of six benchmarks by up to 3.2x.

field-programmable technology | 2012

Pipeline frequency boosting: Hiding dual-ported block RAM latency using intentional clock skew

Alexander Brant; Ameer M. S. Abdelhadi; Aaron Severance; Guy Lemieux

FPGAs are increasingly being used to implement many new applications, including pipelined processor designs. Designers often employ memories to communicate and pass data between these pipeline stages. However, one-cycle communication between sender and receiver is often required. To implement this read-immediately-after-write functionality, bypass registers are needed by most FPGA memory blocks. Read and write latencies to these memories and the bypass can limit clock frequencies, or require extra resources to further pipeline the bypass. Instead of further pipelining the bypass, this paper applies clock skew scheduling to memory write and read ports of a simple bypass circuit. We show that the clock skew provides an improved Fmax without requiring the area overhead of the pipelined bypass. Many configurations of pipelined memory systems are implemented, and their speed and area compared to our design. Memory clock skew scheduling yields the best Fmax of all techniques which preserve functionality, an improvement of 56% over the baseline clock speed, and 14% over the best conventional design. Furthermore, the suggested technique consumes 46% fewer resources than the next best performing technique.

field programmable gate arrays | 2015

Wavefront Skipping using BRAMs for Conditional Algorithms on Vector Processors

Aaron Severance; Joe Edwards; Guy Lemieux

Soft vector processors can accelerate data parallel algorithms on FPGAs while retaining software programmability. To handle divergent control flow, vector processors typically use mask registers and predicated instructions. These work by executing all branches and finally selecting the correct one. Our work improves FPGA based vector processors by adding wavefront skipping, where wavefronts that are completely masked off are skipped. This accelerates conditional algorithms, particularly useful where elements terminate early if simple tests fail but require extensive processing in the worst case. The difference in logic speed and RAM area for FPGA based circuits versus ASICs led us to a different implementation than used in fixed vector processors, storing wavefront offsets in on-chip BRAM rather than computing wavefronts skipped dynamically. Additionally, we allow for partitioning the wavefronts so that partial wavefronts can skip independently of one another. We show that <5% extra area can give up to 3.2× better performance on conditional algorithms. Partial wavefront skipping may not be generally useful enough to be added to a fixed vector processor; it provides up to 65% more performance for up to 27% more area. In an FGPA, however, the designer can use it to make application specific tradeoffs between area and performance.

field-programmable logic and applications | 2013

TputCache: High-frequency, multi-way cache for high-throughput FPGA applications

Aaron Severance; Guy Lemieux

Throughput processing involves using many different contexts or threads to solve multiple problems or subproblems in parallel, where the size of the problem is large enough that latency can be tolerated. Bandwidth is required to support multiple concurrent executions, however, and utilizing multiple external memory channels is costly. For small working sets, FPGA designers can use on-chip BRAMs achieve the necessary bandwidth without increasing the system cost. Designing algorithms around fixed-size local memories is difficult, however, as there is no graceful fallback if the problem size exceeds the amount of local memory. This paper introduces TputCache, a cache designed to meet the needs of throughput processing on FPGAs, giving the throughput performance of on-chip BRAMs when the problem size fits in local memory. The design utilizes a replay based architecture to achieve high frequency with very low resource overheads.

Archive | 2015

Broadening the applicability of FPGA-based soft vector processors

Aaron Severance

A soft vector processor (SVP) is an overlay on top of FPGAs that allows dataparallel algorithms to be written in software rather than hardware, and yet still achieve hardware-like performance. This ease of use comes at an area and speed penalty, however. Also, since the internal design of SVPs are based largely on custom CMOS vector processors, there is additional opportunity for FPGA-specific optimizations and enhancements. This thesis investigates and measures the effects of FPGA-specific changes to SVPs that improve performance, reduce area, and improve ease-of-use; thereby expanding their useful range of applications. First, we address applications needing only moderate performance such as audio filtering where SVPs need only a small number (one to four) of parallel ALUs. We make implementation and ISA design decisions around the goals of producing a compact SVP that effectively utilizes existing BRAMs and DSP Blocks. The resulting VENICE SVP has 2× better performance per logic block than previous designs. Next, we address performance issues with algorithms where some vector elements ‘exit early’ while others need additional processing. Simple vector predication causes all elements to exhibit ‘worst case’ performance. Density time masking (DTM) improves performance of such algorithms by skipping the completed elements when possible, but traditional implementations of DTM are coarse-grained and do not map well to the FPGA fabric. We introduce a BRAM-based implementation that achieves 3.2× higher performance over the base SVP with less than 5% area overhead. Finally, we identify a way to exploit the raw performance of the underlying FPGA fabric by attaching wide, deeply pipelined computational units to SVPs

ieee hot chips symposium | 2011

VENICE: A compact vector processor for FPGA applications

Aaron Severance; Guy Lemieux

This article consists of a collection of slides from the authors conference presentation on VENICE (Vector Extensions to NIOS Implemented Compactly and Elegantly), a SVP (soft vector processor) intended to accelerate computationally intensive applications implemented on an FPGA. SVPs are exclusively for FPGAs, targeted at the productivity gap between writing custom hardware in an HDL and writing software for a soft processor in FPGA-based applications. They provide the convenience of software programming and software compile times, and yet they can achieve over 200x speedup compared to a scalar soft processor.

Explore More