Lee W. Howes | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Lee W. Howes is active.

Explore More

Publication

Featured researches published by Lee W. Howes.

field programmable gate arrays | 2009

A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation

David B. Thomas; Lee W. Howes; Wayne Luk

The future of high-performance computing is likely to rely on the ability to efficiently exploit huge amounts of parallelism. One way of taking advantage of this parallelism is to formulate problems as embarrassingly parallel Monte-Carlo simulations, which allow applications to achieve a linear speedup over multiple computational nodes, without requiring a super-linear increase in inter-node communication. However, such applications are reliant on a cheap supply of high quality random numbers, particularly for the three main maximum entropy distributions: uniform, used as a general source of randomness; Gaussian, for discrete-time simulations; and exponential, for discrete-event simulations. In this paper we look at four different types of platform: conventional multi-core CPUs (Intel Core2); GPUs (NVidia GTX 200); FPGAs (Xilinx Virtex-5); and Massively Parallel Processor Arrays (Ambric AM2000). For each platform we determine the most appropriate algorithm for generating each type of number, then calculate the peak generation rate and estimated power efficiency for each device.

IEEE Transactions on Computers | 2010

Performance Comparison of Graphics Processors to Reconfigurable Logic: A Case Study

Ben Cope; Peter Y. K. Cheung; Wayne Luk; Lee W. Howes

A systematic approach to the comparison of the graphics processor (GPU) and reconfigurable logic is defined in terms of three throughput drivers. The approach is applied to five case study algorithms, characterized by their arithmetic complexity, memory access requirements, and data dependence, and two target devices: the nVidia GeForce 7900 GTX GPU and a Xilinx Virtex-4 field programmable gate array (FPGA). Two orders of magnitude speedup, over a general-purpose processor, is observed for each device for arithmetic intensive algorithms. An FPGA is superior, over a GPU, for algorithms requiring large numbers of regular memory accesses, while the GPU is superior for algorithms with variable data reuse. In the presence of data dependence, the implementation of a customized data path in an FPGA exceeds GPU performance by up to eight times. The trends of the analysis to newer and future technologies are analyzed.

field-programmable technology | 2003

Design space exploration with A Stream Compiler

Oskar Mencer; David J. Pearce; Lee W. Howes; Wayne Luk

We consider speeding up general-purpose applications with hardware accelerators. Traditionally hardware accelerators are tediously hand-crafted to achieve top performance ASC (A Stream Complier) simplifies exploration of hardware accelerators by transforming the hardware design task into a software design process using only gcc and make to obtain a hardware netlist. ASC enables programmers to customize hardware accelarators at three levels of abstraction: the architecture level, the functional block level, and the bit level. All three customizations are based on one uniform representation: a single C++ program with custom types and operators for each level of abstraction. This representation allows ASC users to express and reason about the design space, extract parallelism at each level and quickly evaluate different design choices. In addition, since the user has full control over each gate-level resource in the entire design. ASC accelerator performance can always be equal to or better than hand-crafted designs, usually with much less effort. We present several ASC bench marks, including wavelet compression and Kasumi encryption.

high performance embedded architectures and compilers | 2008

Deriving Efficient Data Movement from Decoupled Access/Execute Specifications

Lee W. Howes; Anton Lokhmotov; Alastair F. Donaldson; Paul H. J. Kelly

On multi-core architectures with software-managed memories, effectively orchestrating data movement is essential to performance, but is tedious and error-prone. In this paper we show that when the programmer can explicitly specify both the memory access pattern and the execution schedule of a computation kernel, the compiler or run-time system can derive efficient data movement, even if analysis of kernel code is difficult or impossible. We have developed a framework of C++ classes for decoupled Access/Execute specifications, allowing for automatic communication optimisations such as software pipelining and data reuse. We demonstrate the ease and efficiency of programming the Cell Broadband Engine architecture using these classes by implementing a set of benchmarks, which exhibit data reuse and non-affine access functions, and by comparing these implementations against alternative implementations, which use hand-written DMA transfers and software-based caching.

field-programmable logic and applications | 2006

Comparing FPGAs to Graphics Accelerators and the Playstation 2 Using a Unified Source Description

Lee W. Howes; Paul Price; Oskar Mencer; Olav Beckmann; Oliver Pell

Field programmable gate arrays (FPGAs), graphics processing units (GPUs) and Sonys Playstation 2 vector units offer scope for hardware acceleration of applications. We compare the performance of these architectures using a unified description based on A Stream Compiler (ASC) for FPGAs, which has been extended to target GPUs and PS2 vector units. Programming these architectures from a single description enables us to reason about optimizations for the different architectures. Using the ASC description we implement a Monte Carlo simulation, a fast Fourier transform (FFT) and a weighted sum algorithm. Our results show that without much optimization the GPU is suited to the Monte Carlo simulation, while the weighted sum is better suited to PS2 vector units. FPGA implementations benefit particularly from architecture specific optimizations which ASC allows us to easily implement by adding simple annotations to the shared code.

computing frontiers | 2009

High-performance SIMT code generation in an active visual effects library

Jay L. T. Cornwall; Lee W. Howes; Paul H. J. Kelly; Phil Parsonage; Bruno Nicoletti

SIMT (Single-Instruction Multiple-Thread) is an emerging programming paradigm for high-performance computational accelerators, pioneered in current and next generation GPUs and hybrid CPUs. We present a domain-specific active-library supported approach to SIMT code generation and optimisation in the field of visual effects. Our approach uses high-level metadata and runtime context to guide and to ensure the correctness of optimisation-driven code transformations and to implement runtime-context-sensitive optimisations. Our advanced optimisations require no analysis of the original C++ kernel code and deliver 1.3x to 6.6x speedups over syntax-directed translation on GeForce 8800 GTX and GTX 260 GPUs with two commercial visual effects.

international conference on parallel processing | 2009

Towards metaprogramming for parallel systems on a chip

Lee W. Howes; Anton Lokhmotov; Alastair F. Donaldson; Paul H. J. Kelly

We demonstrate that the performance of commodity parallel systems significantly depends on low-level details, such as storage layout and iteration space mapping, which motivates the need for tools and techniques that separate a high-level algorithm description from low-level mapping and tuning. We propose to build a tool based on the concept of decoupled Access/Execute metadata which allow the programmer to specify both execution constraints and memory access pattern of a computation kernel.

field-programmable custom computing machines | 2006

FPGAs, GPUs and the PS2 - A Single Programming Methodology

Lee W. Howes; Paul Price; Oskar Mencer; Olav Beckmann

Field programmable gate arrays (FPGAs), graphics processing units (GPUs) and Sonys Playstation 2 vector units offer scope for hardware acceleration of applications. Implementing algorithms on multiple architectures can be a long and complicated process. We demonstrate an approach to compiling for FPGAs, GPUs and PS2 vector units using a unified description based on A Stream Compiler (ASC) for FPGAs. As an example of its use we implement a Monte Carlo simulation using ASC. The unified description allows us to evaluate optimisations for specific architectures on top of a single base description, saving time and effort

high performance embedded architectures and compilers | 2011

A systematic design space exploration approach to customising multi-processor architectures: exemplified using graphics processors

Benjamin Cope; Peter Y. K. Cheung; Wayne Luk; Lee W. Howes

A systematic approach to customising Homogeneous Multi-Processor (HoMP) architectures is described. The approach involves a novel design space exploration tool and a parameterisable system model. Post-fabrication customisation options for using reconfigurable logic with a HoMP are classified. The adoption of the approach in exploring pre- and post-fabrication customisation options to optimise an architectures critical paths is then described. The approach and steps are demonstrated using the architecture of a graphics processor. We also analyse on-chip and off-chip memory access for systems with one or more processing elements (PEs), and study the impact of the number of threads per PE on the amount of off-chip memory access and the number of cycles for each output. It is shown that post-fabrication customisation of a graphics processor can provide up to four times performance improvement for negligible area cost.

Archive | 2011