Peter Y. K. Cheung | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Peter Y. K. Cheung is active.

Explore More

Publication

Featured researches published by Peter Y. K. Cheung.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2003

Wordlength optimization for linear digital signal processing

George A. Constantinides; Peter Y. K. Cheung; Wayne Luk

This paper presents an approach to the wordlength allocation and optimization problem for linear digital signal processing systems implemented as custom parallel processing units. Two techniques are proposed, one which guarantees an optimum set of wordlengths for each internal variable, and one which is a heuristic approach. Both techniques allow the user to tradeoff implementation area for arithmetic error at system outputs. Optimality (with respect to the area and error estimates) is guaranteed through modeling as a mixed integer linear program. It is demonstrated that the proposed heuristic leads to area improvements of 6% to 45% combined with speed increases compared to the optimum uniform wordlength design. In addition, the heuristic reaches within 0.7% of the optimum multiple wordlength area over a range of benchmark problems.

Design Automation for Embedded Systems | 2002

Comparing Three Heuristic Search Methods for Functional Partitioning in Hardware–Software Codesign

Theerayod Wiangtong; Peter Y. K. Cheung; Wayne Luk

This paper compares three heuristic search algorithms: genetic algorithm (GA), simulated annealing (SA) and tabu search (TS), for hardware–software partitioning. The algorithms operate on functional blocks for designs represented as directed acyclic graphs, with the objective of minimising processing time under various hardware area constraints. Thecomparison involves a model for calculating processing time based on a non-increasing first-fit algorithm to schedule tasks, given that shared resource conflicts do not occur. The results show that TS is superior to SA and GA in terms of both search time and quality of solutions. In addition, we have implemented an intensification strategy in TS called penalty reward, which can further improve the quality of results.

field-programmable technology | 2006

Within-die delay variability in 90nm FPGAs and beyond

N. Pete Sedcole; Peter Y. K. Cheung

Semiconductor scaling causes increasing and unavoidable within-die parametric variability. This paper describes accurate measurement techniques for characterising both systematic and stochastic delay variability in FPGAs. Results and analysis are presented from measurements made on a sample of 90nm devices, showing that delay per logic element varies stochastically by plusmn3.54% on average over the set. The delay also varies by up to 3.66% across a single die from correlated sources of variability. The results are extrapolated to determine the impact at future technology nodes. The predicted significant performance degradation that variability will cause demonstrates the importance of new circuit or system design techniques to cope with variations in future FPGAs

field programmable custom computing machines | 1997

Compilation tools for run-time reconfigurable designs

Wayne Luk; Nabeel Shirazi; Peter Y. K. Cheung

This paper describes a framework and tools for automating the production of designs which can be partially reconfigured at run time. The tools include: a partial evaluator, which produces configuration files for a given design, where the number of configurations can be minimised by a process, known as compile-time sequencing; an incremental configuration calculator, which takes the output of the partial evaluator and generates an initial configuration file and incremental configuration files that partially update preceding configurations; and a tool which further optimises designs for FPGAs supporting simultaneous configuration of multiple cells. While many of our techniques are independent of the design language and device used, our tools currently target Xilinx 6200 devices. Simultaneous configuration, for example, can be used to reduce the time for reconfiguring an adder to a subtractor from time linear with respect to its size to constant time at best and logarithmic time at worst.

IEEE Transactions on Computers | 2004

A Gaussian noise generator for hardware-based simulations

Dong-U Lee; Wayne Luk; John D. Villasenor; Peter Y. K. Cheung

Hardware simulation offers the potential of improving code evaluation speed by orders of magnitude over workstation or PC-based simulation. We describe a hardware-based Gaussian noise generator used as a key component in a hardware simulation system, for exploring channel code behavior at very low bit error rates (BERs) in the range of 10/sup -9/ to 10/sup -10/. The main novelty is the design and use of nonuniform piecewise linear approximations in computing trigonometric and logarithmic functions. The parameters of the approximation are chosen carefully to enable rapid computation of coefficients from the inputs while still retaining high fidelity to the modeled functions. The output of the noise generator accurately models a true Gaussian Probability Density Function (PDF) even at very high /spl sigma/ values. Its properties are explored using: 1) several different statistical tests, including the chi-square test and the Anderson-Darling test, and 2) an application for decoding of low-density parity-check (LDPC) codes. An implementation at 133 MHz on a Xilinx Virtex-II XC2V4000-6 FPGA produces 133 million samples per second, which is seven times faster than a 2.6 GHz Pentium-IV PC; another implementation on a Xilinx Spartan-IIE XC2S300E-7 FPGA at 62 MHz is capable of a three times speedup. The performance can be improved by exploiting parallelism: an XC2V4000-6 FPGA with nine parallel instances of the noise generator at 105 MHz can run 50 times faster than a 2.6 GHz Pentium-IV PC. We illustrate the deterioration of clock speed with the increase in the number of instances.

field-programmable custom computing machines | 2007

Enhancing Relocatability of Partial Bitstreams for Run-Time Reconfiguration

Tobias Becker; Wayne Luk; Peter Y. K. Cheung

This paper introduces a method that enhances the relocatability of partial bitstreams for FPGA run-time reconfiguration. Reconfigurable applications usually employ partial bitstreams which are specific to one target region on the FPGA. Previously, techniques have been proposed that allow relocation between identical regions on the FPGA. However, as FPGAs are becoming increasingly heterogeneous, this approach is often too restrictive. We introduce a method that circumvents the problem of having to find fully identical regions based on compatible subsets of resources, enabling flexible placement of relocatable modules. In a software defined radio prototype with two reconfigurable regions, the number of partial bitstreams is reduced by 50% and the compile time is shortened by 43%.

field-programmable technology | 2002

Floating-point bitwidth analysis via automatic differentiation

Altaf Abdul Gaffar; Oskar Mencer; Wayne Luk; Peter Y. K. Cheung; Nabeel Shirazi

Automatic bitwidth analysis is a key ingredient for highlevel programming of FPGAs and high-level synthesis of VLSI circuits. The objective is to find the minimal number of bits to represent a value in order to minimise the circuit area and to improve efficiency of the respective arithmetic operations, while satisfying user-defined numerical constraints. We present a novel approach to bitwidth- or precision-analysis for floating-point designs. The approach involves analysing the dataflow graph representation of a design to see how sensitive the output of a node is to changes in the outputs of other nodes: higher sensitivity requires higher precision and hence more output bits. We automate such sensitivity analysis by a mathematical method called automatic differentiation, which involves differentiating variables in a design with respect to other variables. We illustrate our approach by optimising the bitwidth for two examples, a discrete Fourier transform (DFT) implementation and a Finite Impulse Response (FIR) filter implementation.

IEEE Transactions on Very Large Scale Integration Systems | 2005

Customizable elliptic curve cryptosystems

Ray C. C. Cheung; Nicolas Telle; Wayne Luk; Peter Y. K. Cheung

This paper presents a method for producing hardware designs for elliptic curve cryptography (ECC) systems over the finite field GF(2/sup m/), using the optimal normal basis for the representation of numbers. Our field multiplier design is based on a parallel architecture containing multiple m-bit serial multipliers; by changing the number of such serial multipliers, designers can obtain implementations with different tradeoffs in speed, size and level of security. A design generator has been developed which can automatically produce a customised ECC hardware design that meets user-defined requirements. To facilitate performance characterization, we have developed a parametric model for estimating the number of cycles for our generic ECC architecture. The resulting hardware implementations are among the fastest reported: for a key size of 270 bits, a point multiplication in a Xilinx XC2V6000 FPGA at 35 MHz can run over 1000 times faster than a software implementation on a Xeon computer at 2.6 GHz.

IEEE Computer | 2000

Video image processing with the Sonic architecture

Simon D. Haynes; John Stone; Peter Y. K. Cheung; Wayne Luk

Current industrial video-processing systems use a mixture of high-performance workstations and application-specific integrated circuits. However, video image processing in the professional broadcast environment requires more computational power and data throughput than most of todays general-purpose computers can provide. In addition, using ASICs for video image processing is both inflexible and expensive. Configurable computing offers an appropriate alternative for broadcast video image editing and manipulation by combining the flexibility, programmability, and economy of general-purpose processors with the performance of dedicated ASICs. Sonic is a configurable computing system that performs real-time video image processing. The authors describe how it implements algorithms for two-dimensional linear transforms, fractal image generation, filters, and other video effects. Sonics flexible and scalable architecture contains configurable processing elements that accelerate software applications and support the use of plug-in software.

IEEE Transactions on Computers | 2010

Performance Comparison of Graphics Processors to Reconfigurable Logic: A Case Study

Ben Cope; Peter Y. K. Cheung; Wayne Luk; Lee W. Howes

A systematic approach to the comparison of the graphics processor (GPU) and reconfigurable logic is defined in terms of three throughput drivers. The approach is applied to five case study algorithms, characterized by their arithmetic complexity, memory access requirements, and data dependence, and two target devices: the nVidia GeForce 7900 GTX GPU and a Xilinx Virtex-4 field programmable gate array (FPGA). Two orders of magnitude speedup, over a general-purpose processor, is observed for each device for arithmetic intensive algorithms. An FPGA is superior, over a GPU, for algorithms requiring large numbers of regular memory accesses, while the GPU is superior for algorithms with variable data reuse. In the presence of data dependence, the implementation of a customized data path in an FPGA exceeds GPU performance by up to eight times. The trends of the analysis to newer and future technologies are analyzed.

Explore More