Andrew Canis | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andrew Canis is active.

Explore More

Publication

Featured researches published by Andrew Canis.

field programmable gate arrays | 2011

LegUp: high-level synthesis for FPGA-based processor/accelerator systems

Andrew Canis; Jongsok Choi; Mark Aldham; Victor Zhang; Ahmed Kammoona; Jason Helge Anderson; Stephen Dean Brown; Tomasz S. Czajkowski

In this paper, we introduce a new open source high-level synthesis tool called LegUp that allows software techniques to be used for hardware design. LegUp accepts a standard C program as input and automatically compiles the program to a hybrid architecture containing an FPGA-based MIPS soft processor and custom hardware accelerators that communicate through a standard bus interface. Results show that the tool produces hardware solutions of comparable quality to a commercial high-level synthesis tool.

ACM Transactions in Embedded Computing Systems | 2013

LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems

Andrew Canis; Jongsok Choi; Mark Aldham; Victor Zhang; Ahmed Kammoona; Tomasz S. Czajkowski; Stephen Dean Brown; Jason Helge Anderson

It is generally accepted that a custom hardware implementation of a set of computations will provide superior speed and energy efficiency relative to a software implementation. However, the cost and difficulty of hardware design is often prohibitive, and consequently, a software approach is used for most applications. In this article, we introduce a new high-level synthesis tool called LegUp that allows software techniques to be used for hardware design. LegUp accepts a standard C program as input and automatically compiles the program to a hybrid architecture containing an FPGA-based MIPS soft processor and custom hardware accelerators that communicate through a standard bus interface. In the hybrid processor/accelerator architecture, program segments that are unsuitable for hardware implementation can execute in software on the processor. LegUp can synthesize most of the C language to hardware, including fixed-sized multidimensional arrays, structs, global variables, and pointer arithmetic. Results show that the tool produces hardware solutions of comparable quality to a commercial high-level synthesis tool. We also give results demonstrating the ability of the tool to explore the hardware/software codesign space by varying the amount of a program that runs in software versus hardware. LegUp, along with a set of benchmark C programs, is open source and freely downloadable, providing a powerful platform that can be leveraged for new research on a wide range of high-level synthesis topics.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2016

A Survey and Evaluation of FPGA High-Level Synthesis Tools

Razvan Nane; Vlad Mihai Sima; Christian Pilato; Jongsok Choi; Blair Fort; Andrew Canis; Yu Ting Chen; Hsuan Hsiao; Stephen Dean Brown; Fabrizio Ferrandi; Jason Helge Anderson; Koen Bertels

High-level synthesis (HLS) is increasingly popular for the design of high-performance and energy-efficient heterogeneous systems, shortening time-to-market and addressing todays system complexity. HLS allows designers to work at a higher-level of abstraction by using a software program to specify the hardware functionality. Additionally, HLS is particularly interesting for designing field-programmable gate array circuits, where hardware implementations can be easily refined and replaced in the target device. Recent years have seen much activity in the HLS research community, with a plethora of HLS tool offerings, from both industry and academia. All these tools may have different input languages, perform different internal optimizations, and produce results of different quality, even for the very same input description. Hence, it is challenging to compare their performance and understand which is the best for the hardware to be implemented. We present a comprehensive analysis of recent HLS tools, as well as overview the areas of active interest in the HLS research community. We also present a first-published methodology to evaluate different HLS tools. We use our methodology to compare one commercial and three academic tools on a common set of C benchmarks, aiming at performing an in-depth evaluation in terms of performance and the use of resources.

field-programmable custom computing machines | 2012

Impact of Cache Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems

Jongsok Choi; Kevin Nam; Andrew Canis; Jason Helge Anderson; Stephen Dean Brown; Tomasz S. Czajkowski

We describe new multi-ported cache designs suitable for use in FPGA-based processor/parallel-accelerator systems, and evaluate their impact on application performance and area. The baseline system comprises a MIPS soft processor and custom hardware accelerators with a shared memory architecture: on-FPGA L1 cache backed by off-chip DDR2 SDRAM. Within this general system model, we evaluate traditional cache design parameters (cache size, line size, associativity). In the parallel accelerator context, we examine the impact of the cache design and its interface. Specifically, we look at how the number of cache ports affects performance when multiple hardware accelerators operate (and access memory) in parallel, and evaluate two different hardware implementations of multi-ported caches using: 1) multi-pumping, and 2) a recently-published approach based on the concept of a live-value table. Results show that application performance depends strongly on the cache interface and architecture: for a system with 6 accelerators, depending on the cache design, speed up swings from 0.73× to 6.14×, on average, relative to a baseline sequential system (with a single accelerator and a direct-mapped, 2KB cache with 32B lines). Considering both performance and area, the best architecture is found to be a 4-port multi-pump direct-mapped cache with a 16KB cache size and a 128B line size.

field programmable gate arrays | 2012

Impact of FPGA architecture on resource sharing in high-level synthesis

Stefan Hadjis; Andrew Canis; Jason Helge Anderson; Jongsok Choi; Kevin Nam; Stephen Dean Brown; Tomasz S. Czajkowski

Resource sharing is a key area-reduction approach in high-level synthesis (HLS) in which a single hardware functional unit is used to implement multiple operations in the high-level circuit specification. We show that the utility of sharing depends on the underlying FPGA logic element architecture and that different sharing trade-offs exist when 4-LUTs vs. 6-LUTs are used. We further show that certain multi-operator patterns occur multiple times in programs, creating additional opportunities for sharing larger composite functional units comprised of patterns of interconnected operators. A sharing cost/benefit analysis is used to inform decisions made in the binding phase of an HLS tool, whose RTL output is targeted to Altera commercial FPGA families: Stratix IV (dual-output 6-LUTs) and Cyclone II (4-LUTs).

field programmable logic and applications | 2014

Modulo SDC scheduling with recurrence minimization in high-level synthesis

Andrew Canis; Stephen Dean Brown; Jason Helge Anderson

Loop pipelining is a high-level synthesis scheduling technique that overlaps loop iterations to achieve higher performance. However, industrial designs often have resource constraints and other constraints imposed by cross-iteration dependencies. The interaction between multiple constraints can pose a challenge for HLS modulo scheduling algorithms, which, if not handled properly can lead to a loop pipeline schedule that fails to achieve the minimum possible initiation interval. We propose a novel modulo scheduler based on an SDC formulation that includes a backtracking mechanism to properly handle multiple scheduling constraints and still achieve the minimum possible initiation interval. The SDC formulation has the advantage of being a mathematical framework that supports flexible constraints that are useful for more complex loop pipelines. Furthermore, we describe how to specifically apply associative expression transformations during modulo scheduling to restructure recurrences in complex loops to enable better scheduling. We compared our techniques to existing prior work in modulo scheduling in HLS and also compared against a state-of-art commercial tool. Over a suite of benchmarks, we show that our scheduler and proposed optimizations can result in a geomean wall-clock time reduction of 32% versus prior work and 29% versus a commercial tool.

field-programmable custom computing machines | 2013

The Effect of Compiler Optimizations on High-Level Synthesis for FPGAs

Qijing Huang; Ruolong Lian; Andrew Canis; Jongsok Choi; Ryan Xi; Stephen Dean Brown; Jason Helge Anderson

We consider the impact of compiler optimizations on the quality of high-level synthesis (HLS)-generated FPGA hardware. Using a HLS tool implemented within the state-of-the-art LLVM [1] compiler, we study the effect of compiler optimizations on the hardware metrics of circuit area, execution cycles, Fmax, and wall-clock time. We evaluate 56 different compiler optimizations implemented within LLVM and show that some optimizations significantly affect hardware quality. Moreover, we show that hardware quality is also affected by the order in which optimizations are applied. We then present a new HLS-directed approach to compiler optimizations, wherein we execute partial HLS and profiling at intermittent points in the optimization process and use the results to judiciously undo the impact of optimization passes predicted to be damaging to the generated hardware quality. Results show that our approach produces circuits with 16% better speed performance, on average, versus using the standard -O3 optimization level.

design, automation, and test in europe | 2013

Multi-pumping for resource reduction in FPGA high-level synthesis

Andrew Canis; Jason Helge Anderson; Stephen Dean Brown

Resource sharing is a classic high-level synthesis (HLS) optimization that saves area by mapping multiple operations to a single functional unit. With resource sharing, only operations scheduled in separate cycles can be assigned to shared hardware, which can result in longer schedules. In this paper, we propose a new approach to resource sharing that allows multiple operations to be performed by a single functional unit in one clock cycle. Our approach is based on multi-pumping, which operates functional units at a higher frequency than the surrounding system logic, typically 2×, allowing multiple computations to complete in a single system cycle. Our approach is particularly effective for DSP blocks on an FPGA, which are used to perform multiply and/or accumulate operations. Our results show that resource sharing using multi-pumping is comparable to traditional resource sharing in terms of area saved, but provides significant performance advantages. Specifically, when targeting a 50% reduction in DSP blocks, traditional resource sharing decreases circuit speed performance by 80%, on average, whereas multi-pumping decreases circuit speed by just 5%. Multi-pumping is a viable approach to achieve the area reductions of resource sharing, with considerably less negative impact to circuit performance.

compilers architecture and synthesis for embedded systems | 2013

From software to accelerators with LegUp high-level synthesis

Andrew Canis; Jongsok Choi; Blair Fort; Ruolong Lian; Qijing Huang; Nazanin Calagar; Marcel Gort; Jia Jun Qin; Mark Aldham; Tomasz S. Czajkowski; Stephen Dean Brown; Jason Helge Anderson

Embedded system designers can achieve energy and performance benefits by using dedicated hardware accelerators. However, implementing custom hardware accelerators for an application can be difficult and time intensive. LegUp is an open-source high-level synthesis framework that simplifies the hardware accelerator design process [8]. With LegUp, a designer can start from an embedded application running on a processor and incrementally migrate portions of the program to hardware accelerators implemented on an FPGA. The final application then executes on an automatically-generated software/hardware coprocessor system. This paper presents on overview of the LegUp design methodology and system architecture, and discusses ongoing work on profiling, hardware/software partitioning, hardware accelerator quality improvements, Pthreads/OpenMP support, visualization tools, and debugging support.

application specific systems architectures and processors | 2011

Low-cost hardware profiling of run-time and energy in FPGA embedded processors

Mark Aldham; Jason Helge Anderson; Stephen Dean Brown; Andrew Canis

This paper introduces a low-overhead hardware profiling architecture, called LEAP, that attains real-time cycle and energy profiles of an FPGA-based soft processor. A novel technique is used to associate profiling data with specific functions in a way that is area- and power-efficient. Results show that relative to a previously-published hardware profiler, our design uses up to 18× less area and 8.6× less energy. LEAP is designed to be extensible for a variety of profiling tasks, three of which are investigated in this paper. We also demonstrate the utility of LEAP in the context of hardware/software co-design of processor/accelerator FPGA-based systems.

Explore More