Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Benedict R. Gaster is active.

Publication


Featured researches published by Benedict R. Gaster.


IEEE Computer | 2012

Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck?

Benedict R. Gaster; Lee Howes

With the growth in transistor counts in modern hardware, heterogeneous systems are becoming commonplace. Core counts are increasing such that GPU and CPU designs are reaching deep into the tens of cores. For performance reasons, different cores in a heterogeneous platform follow different design choices. Based on throughput computing goals, GPU cores tend to support wide vectors and substantial register files. Current designs optimize CPU cores for latency, dedicating logic to caches and out-of-order dependence control. Heterogeneous parallel primitives (HPP) addresses two major shortcomings in current GPGPU programming models: it supports full composability by defining abstractions and increases flexibility in execution by introducing braided parallelism. Heterogeneous parallel primitives is an object-oriented, C++11-based programming model that addresses these shortcomings on both CPUs and massively multithreaded GPUs: it supports full composability by defining abstractions using distributed arrays and barrier objects, and it increases flexibility in execution by introducing braided parallelism. This paper implemented a feature-complete version of HPP, including all syntactic constructs, that runs on top of a task-parallel runtime executing on the CPU. They continue to develop and improve the model, including reducing overhead due to channel management, and plan to make a public version available sometime in the future.


Heterogeneous Computing with OpenCL | 2013

Introduction to OpenCL

Benedict R. Gaster; Lee Howes; David R. Kaeli; Perhaad Mistry; Dana Schaa

This chapter introduces OpenCL, the programming fabric that allows one to weave application to execute concurrently. It provides an introduction to the basics of using the OpenCL standard when developing parallel programs. It describes the four different abstraction models defined in the standard and presented examples of OpenCL implementations to place some of the abstraction in context. OpenCL describes execution in fine-grained work-items and can dispatch vast numbers of work-items on architectures with hardware support for fine-grained threading. It is easy to have concerns about scalability. The hierarchical concurrency model implemented by OpenCL ensures that scalable execution can be achieved even while supporting a large number of work items. Work items within a workgroup have a special relationship with one another. They can perform barrier operations to synchronize and they have access to a shared memory address space.


international conference on computer aided design | 2011

GPU programming for EDA with OpenCL

Rasit Onur Topaloglu; Benedict R. Gaster

Graphical processing unit (GPU) computing has been an interesting area of research in the last few years. While initial adapters of the technology have been from image processing domain due to difficulties in programming the GPUs, research on programming languages made it possible for people without the knowledge of low-level programming languages such as OpenGL develop code on GPUs. Two main GPU architectures from AMD (former ATI) and NVIDIA acquired grounds. AMD adapted Stanfords Brook language and made it into an architecture-agnostic programming model. NVIDIA, on the other hand, brought CUDA framework to a wide audience. While the two languages have their pros and cons, such as Brook not being able to scale as well and CUDA having to account for architectural-level decisions, it has not been possible to compile one code on another architecture or across platforms. Another opportunity came with the introduction of the idea of combining one or more CPUs and GPUs on the same die. Eliminating some of the interconnection bandwidth issues, this combination makes it possible to offload tasks with high parallelism to the GPU. The technological direction towards multicores for CPU-only architectures also require a programming methodology change and act as a catalyst for suitable programming languages. Hence, a unified language that can be used both on multiple core CPUs as well as GPUs and their combinations has gained interest. Open Computing Language (OpenCL), developed originally by the Khronos Group of Apple and supported by both AMD and NVIDIA, is seen as the programming language of choice for parallel programming. In this paper, we provide a motivation for our tutorial talk on usage of OpenCL for GPUs and highlight key features of the language. We provide research directions on OpenCL for EDA. In our tutorial talk, we use EDA as our application domain to get the readers started with programming the rising language of parallelism, OpenCL.


architectural support for programming languages and operating systems | 2014

KMA: A Dynamic Memory Manager for OpenCL

Roy Spliet; Lee Howes; Benedict R. Gaster; Ana Lucia Varbanescu

OpenCL is becoming a popular choice for the parallel programming of both multi-core CPUs and GPGPUs. One of the features missing in OpenCL, yet commonly required in irregular parallel applications, is dynamic memory allocation. In this paper, we propose KMA, a first dynamic memory allocator for OpenCL. KMAs design is based on a thorough analysis of a set of 11 algorithms, which shows that dynamic memory allocation is a necessary commodity, typically used for implementing complex data structures (arrays, lists, trees) that need constant restructuring at runtime. Taking into account both the survey findings and the status-quo of OpenCL, we design KMA as a two-layer memory manager that makes smart use of the patterns we identified in our application analysis: its basic functionality provides generic malloc() and free() APIs, while the higher layer provides support for building and efficiently managing dynamic data structures. Our experiments measure the performance and usability of KMA, using both microbenchmarks and a real-life case-study. Results show that when dynamic allocation is mandatory, KMA is a competitive allocator. We conclude that embedding dynamic memory allocation in OpenCL is feasible, but it is a complex, delicate task due to the massive parallelism of the platform and the portability issues between different OpenCL implementations.


computing frontiers | 2010

Efficient implementation of GPGPU synchronization primitives on CPUs

Jayanth Gummaraju; Ben Sander; Laurent Morichetti; Benedict R. Gaster; Lee Howes

The GPGPU model represents a style of execution where thousands of threads execute in a data-parallel fashion, with a large subset (typically 10s to 100s) needing frequent synchronization. As the GPGPU model evolves target both GPUs and CPUs as acceleration targets, thread synchronization becomes an important problem when running on CPUs. CPUs have little hardware support for synchronization and must be emulated in software, reducing application performance. This paper presents software techniques to implement the GPGPU synchronization primitives on CPUs, while maintaining application debug-ability. Performing limit studies using real hardware, we evaluate the potential performance benefits of an efficient barrier primitive.


Heterogeneous Computing with OpenCL | 2013

OpenCL Profiling and Debugging

Benedict R. Gaster; Lee Howes; David R. Kaeli; Perhaad Mistry; Dana Schaa

This chapter discusses the OpenCL profiling and debugging. OpenCL is not limited to writing isolated high-performance kernels but can also speed up parallel applications. This chapter discusses how one can optimize kernels running on OpenCL devices by targeting features of the architecture, and how one can study the interaction between the computational kernels on the device and the host. One needs to measure the performance and study an application as a whole to understand bottlenecks. An OpenCL application can include kernels and a large amount of input/output (IO) between the host and device. OpenCL API provides some basic features for application profiling and how operating system APIs can be used for timing sections of code. Debugging of parallel programs is traditionally more complicated than conventional serial code due to subtle bugs such as race conditions, which are difficult to detect and reproduce.


International Journal of Parallel Programming | 2010

Compilation Techniques for High Level Parallel Code

Benedict R. Gaster; Tim Bainbridge; David Lacey; David Gardner

This paper describes methods to adapt existing optimizing compilers for sequential languages to produce code for parallel processors. In particular it looks at targeting data-parallel processors using SIMD (single instruction multiple data) or vector processors where users need features similar to high-level control flow across the data-parallelism. The premise of the paper is that we do not want to write an optimizing compiler from scratch. Rather, a method is described that allows a developer to take an existing compiler for a sequential language and modify it to handle SIMD extensions. As well as modifying the front-end, the intermediate representation and the code generation to handle the parallelism, specific optimizations are described to target the architecture efficiently.


Archive | 2013

OpenCL Case Study

Benedict R. Gaster; Lee Howes; David R. Kaeli; Perhaad Mistry; Dana Schaa

Publisher Summary This chapter discusses a classical computational kernel, convolution, which is used in many machine vision, statistics, and signal processing applications. It presents how to approach optimization of this OpenCL kernel when targeting either AMD or NVIDIA GPUs. It explores the benefits of different memory optimizations and shows that performance is heavily dependent on the underlying memory architecture of the different devices. However, for all devices considered, significant performance improvements were obtained in the computational portions of the algorithm by giving up the generality of the double convolution loops and unrolling for specific kernel sizes. In general, many performance optimizations depend on the specifics of the underlying device hardware architecture. To obtain peak performance, the programmer should be equipped with this information.


Heterogeneous Computing with OpenCL | 2013

Dissecting a CPU/GPU OpenCL Implementation

Benedict R. Gaster; Lee Howes; David R. Kaeli; Perhaad Mistry; Dana Schaa

This chapter shows a very specific mapping of OpenCL to an architectural implementation. It was shown how OpenCL maps slightly differently to a CPU architecture and a GPU architecture. The core principles of this chapter apply to competing CPU and GPU architectures, but significant differences in performance can easily arise from variation in vector width, variations in thread context management, and instruction scheduling. The design of OpenCL is such that the model maps capably to a wide range of architectures, allowing for tuning and acceleration of kernel code. The OpenCL CPU runtime creates a thread to execute on each core of the CPU as a work pool to process OpenCL kernels as they are generated. These threads are passed work by a core management thread for each queue that has the role of removing the first entry from the queue and setting up work for the worker threads.


Heterogeneous Computing with OpenCL | 2013

Introduction to Parallel Programming

Benedict R. Gaster; Lee Howes; David R. Kaeli; Perhaad Mistry; Dana Schaa

This chapter is an introduction to parallel programming. It is organized to address the need for teaching parallel programming on current system architectures using OpenCL as the target language, and it includes examples for CPUs, GPUs, and their integration in the accelerated processing unit (APU). Its other major goal is to provide a guide to programmers to develop well-designed programs in OpenCL targeting parallel systems. It leads the programmer through the various abstractions and features provided by the OpenCL programming environment. The examples offer a simple introduction and more complicated optimizations, and suggest further development and goals at which to aim.

Collaboration


Dive into the Benedict R. Gaster's collaboration.

Top Co-Authors

Avatar

Lee Howes

Advanced Micro Devices

View shared research outputs
Top Co-Authors

Avatar

Dana Schaa

Northeastern University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

David A. Wood

University of Wisconsin-Madison

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge