Gjw Gert-Jan van den Braak

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gjw Gert-Jan van den Braak is active.

Explore More

Publication

Featured researches published by Gjw Gert-Jan van den Braak.

high-performance computer architecture | 2014

A detailed GPU cache model based on reuse distance theory

C Cedric Nugteren; Gjw Gert-Jan van den Braak; Henk Corporaal; Henri E. Bal

As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: (1) the GPUs hierarchy of threads, warps, threadblocks, and sets of active threads, (2) conditional and non-uniform latencies, (3) cache associativity, (4) miss-status holding-registers, and (5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.

general purpose processing on graphics processing units | 2011

High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs

C Cedric Nugteren; Gjw Gert-Jan van den Braak; Henk Corporaal; B Bart Mesman

Graphics Processing Units (GPUs) are suitable for highly data parallel algorithms such as image processing, due to their massive parallel processing power. Many image processing applications use the histogramming algorithm, which fills a set of bins according to the frequency of occurrence of pixel values taken from an input image.n Histogramming has been mapped on a GPU prior to this work. Although significant research effort has been spent in optimizing the mapping, we show that the performance and performance predictability of existing methods can still be improved. In this paper, we present two novel histogramming methods, both achieving a higher performance and predictability than existing methods. We discuss performance limitations for both novel methods by exploring algorithm trade-offs.n Both the novel and the existing histogramming methods are evaluated for performance. The first novel method gives an average performance increase of 33% over existing methods for non-synthetic benchmarks. The second novel method gives an average performance increase of 56% over existing methods and guarantees to be fully data independent. While the second method is specifically designed for newer GPU architectures, the first method is also suitable for older architectures.

ieee international conference on high performance computing data and analytics | 2015

Adaptive and transparent cache bypassing for GPUs

A Ang Li; Gjw Gert-Jan van den Braak; Akash Kumar; Henk Corporaal

In the last decade, GPUs have emerged to be widely adopted for general-purpose applications. To capture on-chip locality for these applications, modern GPUs have integrated multilevel cache hierarchy, in an attempt to reduce the amount and latency of the massive and sometimes irregular memory accesses. However, inferior performance is frequently attained due to serious congestion in the caches results from the huge amount of concurrent threads. In this paper, we propose a novel compile-time framework for adaptive and transparent cache bypassing on GPUs. It uses a simple yet effective approach to control the bypass degree to match the size of applications runtime footprints. We validate the design on seven GPU platforms that cover all existing GPU generations using 16 applications from widely used GPU benchmarks. Experiments show that our design can significantly mitigate the negative impact due to small cache sizes and improve the overall performance. We analyze the performance across different platforms and applications. We also propose some optimization guidelines on how to efficiently use the GPU caches.

advanced concepts for intelligent vision systems | 2011

Fast hough transform on GPUs: exploration of algorithm trade-offs

Gjw Gert-Jan van den Braak; C Cedric Nugteren; B Bart Mesman; Henk Corporaal

The Hough transform is a commonly used algorithm to detect lines and other features in images. It is robust to noise and occlusion, but has a large computational cost. This paper introduces two new implementations of the Hough transform for lines on a GPU. One focuses on minimizing processing time, while the other has an input-data independent processing time. Our results show that optimizing the GPU code for speed can achieve a speed-up over naive GPU code of about 10×. The implementation which focuses on processing speed is the faster one for most images, but the implementation which achieves a constant processing time is quicker for about 20% of the images.

international conference on embedded computer systems: architectures, modeling, and simulation | 2010

Compile-time GPU memory access optimizations

Gjw Gert-Jan van den Braak; B Bart Mesman; Henk Corporaal

In the last three years, GPUs are more and more being used for general purpose applications instead of only for computer graphics. Programming these GPUs is a big challenge; in current GPUs the main bottleneck for many applications is not the computing power, but the memory access bandwidth. Two compile-time optimizations are presented in this paper to deal with the two most important memory access issues. To describe these optimizations, a new notation of the parallel execution of GPU programs is introduced. An implementation of the optimizations shows that performance improvements of up to 40 times are possible.

international conference on parallel processing | 2012

GPU-vote: a framework for accelerating voting algorithms on GPU

Gjw Gert-Jan van den Braak; C Cedric Nugteren; B Bart Mesman; Henk Corporaal

Voting algorithms, such as histogram and Hough transforms, are frequently used algorithms in various domains, such as statistics and image processing. Algorithms in these domains may be accelerated using GPUs. Implementing voting algorithms efficiently on a GPU however is far from trivial due to irregularities and unpredictable memory accesses. Existing GPU implementations therefore target only specific voting algorithms while we propose in this work a methodology which targets voting algorithms in general. n nThis methodology is used in gpu-vote, a framework to accelerate current and future voting algorithms on a GPU without significant programming effort. We classify voting algorithms into four categories. We describe a transformation to merge categories which enables gpu-vote to have a single implementation for all voting algorithms. Despite the generality of gpu-vote, being able to handle various voting algorithms, its performance is not compromised. Compared to recently published GPU implementations of the Hough transform and the histogram algorithms, gpu-vote yields a 11% and 38% lower execution time respectively. Additionally, we give an accurate and intuitive performance prediction model for the generalized GPU voting algorithm. Our model can predict the execution time of gpu-vote within an average absolute error of 5%.

international conference on supercomputing | 2015

Fine-Grained Synchronizations and Dataflow Programming on GPUs

A Ang Li; Gjw Gert-Jan van den Braak; Henk Corporaal; Akash Kumar

The last decade has witnessed the blooming emergence of many-core platforms, especially the graphic processing units (GPUs). With the exponential growth of cores in GPUs, utilizing them efficiently becomes a challenge. The data-parallel programming model assumes a single instruction stream for multiple concurrent threads (SIMT); therefore little support is offered to enforce thread ordering and fine-grained synchronizations. This becomes an obstacle when migrating algorithms which exploit fine-grained parallelism, to GPUs, such as the dataflow algorithms. In this paper, we propose a novel approach for fine-grained inter-thread synchronizations on the shared memory of modern GPUs. We demonstrate its performance and compare it with other fine-grained and medium-grained synchronization approaches. Our method achieves 1.5x speedup over the warp-barrier based approach and 4.0x speedup over the atomic spin-lock based approach on average. To further explore the possibility of realizing fine-grained dataflow algorithms on GPUs, we apply the proposed synchronization scheme to Needleman-Wunsch - a 2D wavefront application involving massive cross-loop data dependencies. Our implementation achieves 3.56x speedup over the atomic spin-lock implementation and 1.15x speedup over the conventional data-parallel implementation for a basic sub-grid, which implies that the fine-grained, lock-based programming pattern could be an alternative choice for designing general-purpose GPU applications (GPGPU).

Proceedings of International Workshop on Adaptive Self-tuning Computing Systems | 2014

Roofline-aware DVFS for GPUs

C Cedric Nugteren; Gjw Gert-Jan van den Braak; Henk Corporaal

Graphics processing units (GPUs) are becoming increasingly popular for compute workloads, mainly because of their large number of processing elements and high-bandwidth to off-chip memory. The roofline model captures the ratio between the two (the compute-memory ratio), an important architectural parameter. This work proposes to change the compute-memory ratio dynamically, scaling the voltage and frequency (DVFS) of 1) memory for compute-intensive workloads and 2) processing elements for memory-intensive workloads. The result is an adaptive roofline-aware GPU that increases energy efficiency (up to 58%) while maintaining performance.

design, automation, and test in europe | 2013

Future of GPGPU micro-architectural parameters

C Cedric Nugteren; Gjw Gert-Jan van den Braak; Henk Corporaal

As graphics processing units (GPUs) are becoming increasingly popular for general purpose workloads (GPGPU), the question arises how such processors will evolve architecturally in the near future. In this work, we identify and discuss trade-offs for three GPU architecture parameters: active thread count, compute-memory ratio, and cluster and warp sizing. For each parameter, we propose changes to improve GPU design, keeping in mind trends such as dark silicon and the increasing popularity of GPGPU architectures. A key-enabler is dynamism and workload-adaptiveness, enabling among others: dynamic register file sizing, latency aware scheduling, roofline-aware DVFS, run-time cluster fusion, and dynamic warp sizing.

software and compilers for embedded systems | 2013

GPU-CC: a reconfigurable GPU architecture with communicating cores

Gjw Gert-Jan van den Braak; Henk Corporaal

GPUs have evolved to programmable, energy efficient compute accelerators for massively parallel applications. Still, compute power is lost in many applications because of cycles spent on data movement and control instead of computations on actual data. Additional cycles can be lost as well on pipeline stalls due to long latency operations.n To improve performance and energy efficiency, we introduce GPU-CC: a reconfigurable GPU architecture with communicating cores. It is based on a contemporary GPU, which can still be used as such, but also has the ability to reorganize the cores of a GPU in a reconfigurable network. In GPU-CC data movement and control is implicit in the configuration of the communication network. Additionally each core executes a fixed instruction, reducing instruction decode count and increasing energy efficiency. We show a large performance potential for GPU-CC, e.g. 1.9x and 2.4x for a 3x3 and 5x5 convolution application. The hardware cost of GPU-CC is mainly determined by the buffers in the added network, which amounts to 12.4% of extra memory space.

Explore More