Paul R. Schumacher
Xilinx
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Paul R. Schumacher.
field-programmable custom computing machines | 2004
Phil James-Roxby; Paul R. Schumacher; Charlie Ross
The prevalence of software reference code motivates investigation into efficient implementations of software architectures on field-programmable devices. Modern FPGAs allow designers to generate multi-processor architectures that exactly match the processing needs of the algorithm. This paper describes an architecture supporting the single program multiple data model of parallel processing, and presents results taken from a parallel implementation of the JPEG2000 encoding algorithm and Mandelbrot set generation.
field-programmable logic and applications | 2008
Paul R. Schumacher; Pradip K. Jha
FPGAs have become complex, heterogeneous platforms targeting a multitude of different applications. Understanding how a design maps to them and consumes various FPGA resources can be difficult to predict, so typically designers are forced to run full synthesis on each iteration of the design. For complex designs that involve many iterations and optimizations, the run-time of synthesis can be quite prohibitive. In this paper, we describe a fast and accurate method of estimating the FPGA resources of any RTL-based design. We achieve run-times that are more than 60 times faster than synthesis and is on average within 22% of the actual mapped slices across a large benchmark suite targeting three different FPGA families. This resource estimator tool is first provided in Xilinx PlanAhead 10.1.
international conference on image processing | 2005
Paul R. Schumacher; K. Denolf; A. Chilira-RUs; Robert D. Turney; N. Fedele; Kees A. Vissers; J. Bormans
Increasing resolutions push the throughput requirements of video codecs and complicate the challenges encountered during their cost-efficient implementations. We propose an FPGA implementation of a high-performance MPEG-4 simple profile video decoder, capable of parsing multiple bitstreams from different encoder sources. Its video pipeline architecture exploits the inherent functional parallelism and enables multi-stream support at a limited FPGA resource cost compared to a single stream version. The design is scalable with a number of added compile-time parameters - including maximum frame size and number of input bitstreams - which can be set by the user to suit his application.
IEEE Transactions on Consumer Electronics | 2003
Paul R. Schumacher
The recently approved digital still image standard known as JPEG2000 promises to be an excellent image and video format for use with a large range of applications. For adoption of the standard to take place in the consumer marketplace, implementations supporting real time encoding and decoding of popular image and video formats must be created. It is a well-known fact that the major bottleneck of a JPEG2000 system is the bit/context modeling and arithmetic coding tasks. This paper discusses a hardware implementation of a tier-1 coder that exploits available parallelisms. The proposed technique described in this paper is approximately 50% faster than the best technique described in the literature.
power and timing modeling optimization and simulation | 2003
Massimo Ravasi; Marco Mattavelli; Paul R. Schumacher; Robert D. Turney
The increasing complexity of processing algorithms has lead to the need of more and more intensive specification and validation by means of soft- ware implementations. As the complexity grows, the intuitive understanding of the specific processing needs becomes harder. Hence, the architectural imple- mentation choices or the choices between different possible software/hardware partitioning become extremely difficult tasks. Automatic tools for complexity analysis at high abstraction level are nowadays a fundamental need. This paper describes a new automatic tool for high-level algorithmic complexity analysis, the Software Instrumentation Tool (SIT), and presents the results concerning the complexity analysis and design space exploration for the implementation of a JPEG2000 encoder using a hardware/software co-design methodology on a Xilinx Virtex-II™ platform FPGA. The analysis and design process for the im- plementation of a video surveillance application example is described.
Optical Science and Technology, SPIE's 48th Annual Meeting | 2003
Paul R. Schumacher; Mark Paluszkiewicz; Rick Ballantyne; Robert D. Turney
While the recent JPEG2000 standard only specifies the bitstream and file formats to ensure interoperability, it leaves the actual implementation up to the designer. Like many DSP applications, there are a number of implementation platform options for the designer. This paper gives a complexity analysis of an implementation of a JPEG2000 encoder using a hardware/software co-design methodology on a Xilinx Virtex-II(TM) platform FPGA. Central to the performance of the encoder is a high-throughput tier-1 entropy coder. This paper will describe the encoder design targeted for video surveillance applications, and will compare and contrast with two other implementation options.
Eurasip Journal on Embedded Systems | 2007
Kristof Denolf; Adrian Chirila-Rus; Paul R. Schumacher; Robert D. Turney; Kees A. Vissers; Diederik Verkest; Henk Corporaal
The higher resolutions and new functionality of video applications increase their throughput and processing requirements. In contrast, the energy and heat limitations of mobile devices demand low-power video cores. We propose a memory and communication centric design methodology to reach an energy-efficient dedicated implementation. First, memory optimizations are combined with algorithmic tuning. Then, a partitioning exploration introduces parallelism using a cyclo-static dataflow model that also expresses implementation-specific aspects of communication channels. Towards hardware, these channels are implemented as a restricted set of communication primitives. They enable an automated RTL development strategy for rigorous functional verification. The FPGA/ASIC design of an MPEG-4 Simple Profile video codec demonstrates the methodology. The video pipeline exploits the inherent functional parallelism of the codec and contains a tailored memory hierarchy with burst accesses to external memory. 4CIF encoding at 30 fps, consumes 71 mW in a 180 nm, 1.62 V UMC technology.
rapid system prototyping | 2005
Adrian Chirila-Rus; Kristof Denolf; Bart Vanhoof; Paul R. Schumacher; Kees A. Vissers
Dedicated hardware realizations of new multimedia applications support high throughput in a cost-efficient way. Their design requires a correct translation of the high-level system definition into the final implementation at the RTL level. We propose a general systematic development and test methodology and apply it in the context of complex video codecs. The approach is based on a fixed set of communication primitives and uses a high-level functional C model as golden specification of the complete system throughout the design. A clear separation between I/O and computing allows the isolation of a single functional component. This module is developed individually and its functional correctness can be verified separately by extracting its input stimuli and expected output from the golden specification. The combination of RTL simulation and emulation on a prototyping platform enables exhaustive testing of the separate modules to assure functional correctness. In this way, the debug cycle during system integration is minimized.
field-programmable technology | 2015
Jasmina Vasiljevic; Ralph D. Wittig; Paul R. Schumacher; Jeff Fifield; Fernando Martinez Vallina; Henry E. Styles; Paul Chow
In recent years, high-level languages and compilers, such as OpenCL have improved both productivity and FPGA adoption on a wider scale. One of the challenges in the design of high-performance stream FPGA applications is iterative manual optimization of the numerous application buffers (e.g., arrays, FIFOs and scratch-pads). First, to achieve the desired throughput, the programmer faces the burden of analyzing the memory accesses of each application buffer, and based on observed data locality determines the optimal on-chip buffering, and off-chip read/write data access strategy. Second, to minimize throughput bottlenecks, the programmer has to carefully partition the limited on-chip memory resources among many application buffers. In this work we present an FPGA OpenCL library of pre-optimized stream memory components (SMCs). The library contains three types of SMCs, which implement frequently applied data transformations: 1) stencil, 2) transpose and 3) tiling. The library generates SMCs that are optimized both for the specific data transformation they perform as well as the user specified data set size. Further, to ease the partitioning of on-chip memory resources among many application memories, the library automatically maps application buffers to on-chip and off-chip memory resources. This is achieved by enabling the programmer to specify an on-chip memory budget for each component. In terms of on-chip memory, the SMCs perform data buffering to exploit data locality and maximize reuse. In terms of off-chip memory accesses, the SMCs optimize read/write memory operations by performing data coalescing, bursting and prefetching. We show that using the SMC library, the programmer can quickly generate scalable, pre-optimized stream application memory components, thus reaching throughput targets without time consuming manual memory optimization.
field-programmable logic and applications | 2011
Paul R. Schumacher; Pradip K. Jha; Sudha Kuntur; Tim Burke; Alan M. Frost
This paper presents a fast method of performing RTL power estimation. A context-based, activity propagation engine is used to analyze specific structures identified in the RTL. This estimator was integrated into an FPGA tool flow to provide near instant feedback on expected power dissipation. To fully validate our methodology, a large benchmark suite of designs was used to target three different FPGA families. Our results were compared against a commercial gate-level power estimator. Results show a high level of accuracy (total power average error within 8.1% of a post-route analysis) while achieving a median run-time of 1.84 sec., more than 1000 times faster than a complete place and route flow.