Lee Howes | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Lee Howes is active.

Explore More

Publication

Featured researches published by Lee Howes.

IEEE Computer | 2012

Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck?

Benedict R. Gaster; Lee Howes

With the growth in transistor counts in modern hardware, heterogeneous systems are becoming commonplace. Core counts are increasing such that GPU and CPU designs are reaching deep into the tens of cores. For performance reasons, different cores in a heterogeneous platform follow different design choices. Based on throughput computing goals, GPU cores tend to support wide vectors and substantial register files. Current designs optimize CPU cores for latency, dedicating logic to caches and out-of-order dependence control. Heterogeneous parallel primitives (HPP) addresses two major shortcomings in current GPGPU programming models: it supports full composability by defining abstractions and increases flexibility in execution by introducing braided parallelism. Heterogeneous parallel primitives is an object-oriented, C++11-based programming model that addresses these shortcomings on both CPUs and massively multithreaded GPUs: it supports full composability by defining abstractions using distributed arrays and barrier objects, and it increases flexibility in execution by introducing braided parallelism. This paper implemented a feature-complete version of HPP, including all syntactic constructs, that runs on top of a task-parallel runtime executing on the CPU. They continue to develop and improve the model, including reducing overhead due to channel management, and plan to make a public version available sometime in the future.

Heterogeneous Computing with OpenCL | 2013

Introduction to OpenCL

Benedict R. Gaster; Lee Howes; David R. Kaeli; Perhaad Mistry; Dana Schaa

This chapter introduces OpenCL, the programming fabric that allows one to weave application to execute concurrently. It provides an introduction to the basics of using the OpenCL standard when developing parallel programs. It describes the four different abstraction models defined in the standard and presented examples of OpenCL implementations to place some of the abstraction in context. OpenCL describes execution in fine-grained work-items and can dispatch vast numbers of work-items on architectures with hardware support for fine-grained threading. It is easy to have concerns about scalability. The hierarchical concurrency model implemented by OpenCL ensures that scalable execution can be achieved even while supporting a large number of work items. Work items within a workgroup have a special relationship with one another. They can perform barrier operations to synchronize and they have access to a shared memory address space.

architectural support for programming languages and operating systems | 2014

KMA: A Dynamic Memory Manager for OpenCL

Roy Spliet; Lee Howes; Benedict R. Gaster; Ana Lucia Varbanescu

OpenCL is becoming a popular choice for the parallel programming of both multi-core CPUs and GPGPUs. One of the features missing in OpenCL, yet commonly required in irregular parallel applications, is dynamic memory allocation. In this paper, we propose KMA, a first dynamic memory allocator for OpenCL. KMAs design is based on a thorough analysis of a set of 11 algorithms, which shows that dynamic memory allocation is a necessary commodity, typically used for implementing complex data structures (arrays, lists, trees) that need constant restructuring at runtime. Taking into account both the survey findings and the status-quo of OpenCL, we design KMA as a two-layer memory manager that makes smart use of the patterns we identified in our application analysis: its basic functionality provides generic malloc() and free() APIs, while the higher layer provides support for building and efficiently managing dynamic data structures. Our experiments measure the performance and usability of KMA, using both microbenchmarks and a real-life case-study. Results show that when dynamic allocation is mandatory, KMA is a competitive allocator. We conclude that embedding dynamic memory allocation in OpenCL is feasible, but it is a complex, delicate task due to the massive parallelism of the platform and the portability issues between different OpenCL implementations.

computing frontiers | 2010

Efficient implementation of GPGPU synchronization primitives on CPUs

Jayanth Gummaraju; Ben Sander; Laurent Morichetti; Benedict R. Gaster; Lee Howes

The GPGPU model represents a style of execution where thousands of threads execute in a data-parallel fashion, with a large subset (typically 10s to 100s) needing frequent synchronization. As the GPGPU model evolves target both GPUs and CPUs as acceleration targets, thread synchronization becomes an important problem when running on CPUs. CPUs have little hardware support for synchronization and must be emulated in software, reducing application performance. This paper presents software techniques to implement the GPGPU synchronization primitives on CPUs, while maintaining application debug-ability. Performing limit studies using real hardware, we evaluate the potential performance benefits of an efficient barrier primitive.

Heterogeneous Computing with OpenCL | 2013

OpenCL Profiling and Debugging

Benedict R. Gaster; Lee Howes; David R. Kaeli; Perhaad Mistry; Dana Schaa

This chapter discusses the OpenCL profiling and debugging. OpenCL is not limited to writing isolated high-performance kernels but can also speed up parallel applications. This chapter discusses how one can optimize kernels running on OpenCL devices by targeting features of the architecture, and how one can study the interaction between the computational kernels on the device and the host. One needs to measure the performance and study an application as a whole to understand bottlenecks. An OpenCL application can include kernels and a large amount of input/output (IO) between the host and device. OpenCL API provides some basic features for application profiling and how operating system APIs can be used for timing sections of code. Debugging of parallel programs is traditionally more complicated than conventional serial code due to subtle bugs such as race conditions, which are difficult to detect and reproduce.

Proceedings of SPIE | 2013

Vasculature segmentation using parallel multi-hypothesis template tracking on heterogeneous platforms

Dongping Zhang; Lee Howes

We present a parallel multi-hypothesis template tracking algorithm on heterogeneous platforms using a layered dispatch programming model. The contributions of this work are: an architecture-specific optimised solution for vasculature structure enhancement, an approach to segment the vascular lumen network from volumetric CTA images and a layered dispatch programming model to free the developers from hand-crafting mappings to particularly constrained execution domains on high throughput architecture. This abstraction is demonstrated through a vasculature segmentation application and can also be applied in other real-world applications. Current GPGPU programming models define a grouping concept which may lead to poorly scoped lo cal/ shared memory regions and an inconvenient approach to projecting complicated iterations spaces. To improve on this situation, we propose a simpler and more flexible programming model that leads to easier computation projections and hence a more convenient mapping of the same algorithm to a wide range of architectures. We first present an optimised image enhancement solution step- by-step, then solve a separable nonlinear least squares problem using a parallel Levenberg-Marquardt algorithm for template matching, and perform the energy efficiency analysis and performance comparison on a variety of platforms, including multi-core CPUs, discrete GPUs and APUs. We propose and discuss the efficiency of a layered-dispatch programming abstraction for mapping algorithms onto heterogeneous architectures.

Archive | 2013

OpenCL Case Study

Benedict R. Gaster; Lee Howes; David R. Kaeli; Perhaad Mistry; Dana Schaa

Publisher Summary This chapter discusses a classical computational kernel, convolution, which is used in many machine vision, statistics, and signal processing applications. It presents how to approach optimization of this OpenCL kernel when targeting either AMD or NVIDIA GPUs. It explores the benefits of different memory optimizations and shows that performance is heavily dependent on the underlying memory architecture of the different devices. However, for all devices considered, significant performance improvements were obtained in the computational portions of the algorithm by giving up the generality of the double convolution loops and unrolling for specific kernel sizes. In general, many performance optimizations depend on the specifics of the underlying device hardware architecture. To obtain peak performance, the programmer should be equipped with this information.

Heterogeneous Computing with OpenCL | 2013

Dissecting a CPU/GPU OpenCL Implementation

Benedict R. Gaster; Lee Howes; David R. Kaeli; Perhaad Mistry; Dana Schaa

This chapter shows a very specific mapping of OpenCL to an architectural implementation. It was shown how OpenCL maps slightly differently to a CPU architecture and a GPU architecture. The core principles of this chapter apply to competing CPU and GPU architectures, but significant differences in performance can easily arise from variation in vector width, variations in thread context management, and instruction scheduling. The design of OpenCL is such that the model maps capably to a wide range of architectures, allowing for tuning and acceleration of kernel code. The OpenCL CPU runtime creates a thread to execute on each core of the CPU as a work pool to process OpenCL kernels as they are generated. These threads are passed work by a core management thread for each queue that has the role of removing the first entry from the queue and setting up work for the worker threads.

Heterogeneous Computing with OpenCL | 2013

Introduction to Parallel Programming

Benedict R. Gaster; Lee Howes; David R. Kaeli; Perhaad Mistry; Dana Schaa

This chapter is an introduction to parallel programming. It is organized to address the need for teaching parallel programming on current system architectures using OpenCL as the target language, and it includes examples for CPUs, GPUs, and their integration in the accelerated processing unit (APU). Its other major goal is to provide a guide to programmers to develop well-designed programs in OpenCL targeting parallel systems. It leads the programmer through the various abstractions and features provided by the OpenCL programming environment. The examples offer a simple introduction and more complicated optimizations, and suggest further development and goals at which to aim.

Heterogeneous Computing with OpenCL | 2013

OpenCL Device Architectures

Benedict R. Gaster; Lee Howes; David R. Kaeli; Perhaad Mistry; Dana Schaa

This chapter discusses the open CL device architectures. It discusses the types of architecture that OpenCL might run on and the trade-offs in the architectural design space that these architectures embody. OpenCL has been developed by a wide range of industry groups to satisfy the need to standardize programming models that can achieve good or high performance across the range of devices available on the market. OpenCL is designed to be a platform-independent applications programming interface (API), at the algorithm level and, consequently, at the level of kernel implementation, true platform independence in terms of performance is still a goal (versus a reality). As a developer one needs to understand the potential advantages of different hardware features, the key runtime characteristics of these devices, and where these devices fit into the different classes of computer architectures.

Explore More