Ben H. H. Juurlink | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ben H. H. Juurlink is active.

Explore More

Publication

Featured researches published by Ben H. H. Juurlink.

IEEE Transactions on Circuits and Systems for Video Technology | 2012

Parallel Scalability and Efficiency of HEVC Parallelization Approaches

Chi Ching Chi; Mauricio Alvarez-Mesa; Ben H. H. Juurlink; Gordon Clare; Félix Henry; Stéphane Pateux; Thomas Schierl

Unlike H.264/advanced video coding, where parallelism was an afterthought, High Efficiency Video Coding currently contains several proposals aimed at making it more parallel-friendly. A performance comparison of the different proposals, however, has not yet been performed. In this paper, we will fill this gap by presenting efficient implementations of the most promising parallelization proposals, namely tiles and wavefront parallel processing (WPP). In addition, we present a novel approach called overlapped wavefront (OWF), which achieves higher performance and efficiency than tiles and WPP. Experiments conducted on a 12-core system running at 3.33 GHz show that our implementations achieve average speedups, for 4k sequences, of 8.7, 9.3, and 10.7 for WPP, tiles, and OWF, respectively.

parallel computing | 2003

The Paderborn University BSP (PUB) library

Olaf Bonorden; Ben H. H. Juurlink; Ingo von Otte; Ingo Rieping

The Paderborn University BSP (PUB) library is a C communication library based on the BSP model. The basic library supports buffered as well as unbuffered non-blocking communication between any pair of processors and a mechanism for synchronizing the processors in a barrier style. In addition, PUB provides non-blocking collective communication operations on arbitrary subsets of processors, the ability to partition the processors into independent groups that execute asynchronously from each other, and a zero-cost synchronization mechanism. Furthermore, some techniques used in the implementation of the PUB library deviate significantly from the techniques used in other BSP libraries.

signal processing systems | 2009

Parallel Scalability of Video Decoders

Cor Meenderinck; Arnaldo Azevedo; Ben H. H. Juurlink; Mauricio Alvarez Mesa; Alex Ramirez

An important question is whether emerging and future applications exhibit sufficient parallelism, in particular thread-level parallelism, to exploit the large numbers of cores future chip multiprocessors (CMPs) are expected to contain. As a case study we investigate the parallelism available in video decoders, an important application domain now and in the future. Specifically, we analyze the parallel scalability of the H.264 decoding process. First we discuss the data structures and dependencies of H.264 and show what types of parallelism it allows to be exploited. We also show that previously proposed parallelization strategies such as slice-level, frame-level, and intra-frame macroblock (MB) level parallelism, are not sufficiently scalable. Based on the observation that inter-frame dependencies have a limited spatial range we propose a new parallelization strategy, called Dynamic 3D-Wave. It allows certain MBs of consecutive frames to be decoded in parallel. Using this new strategy we analyze the limits to the available MB-level parallelism in H.264. Using real movie sequences we find a maximum MB parallelism ranging from 4000 to 7000. We also perform a case study to assess the practical value and possibilities of a highly parallelized H.264 application. The results show that H.264 exhibits sufficient parallelism to efficiently exploit the capabilities of future manycore CMPs.

international conference on acoustics, speech, and signal processing | 2012

Parallel video decoding in the emerging HEVC standard

Mauricio Alvarez-Mesa; Chi Ching Chi; Ben H. H. Juurlink; Valeri George; Thomas Schierl

In this paper we propose and evaluate a parallelization strategy for the emerging HEVC video coding standard. The proposed strategy is based on entropy slices which allows exploiting parallelism in the entropy decoding stage while maintaining high coding efficiency. Our approach requires to encode videos with one entropy slice per LCU row in order to decode multiple LCU rows in a wavefront parallel manner. Evaluations performed on a PC with 12 Intel Xeon cores running at 3.3 GHz show that it is possible to achieve real-time performance for 1920×1080p50 (53.1 fps) and 2560×1600 (29.5fps) video resolutions with speedups of 5.2× and 6.3× compared to sequential execution, respectively.

international parallel processing symposium | 1999

The Paderborn university BSP (PUB) library-design, implementation and performance

Olaf Bonorden; Ben H. H. Juurlink; I. Von Otte; I. Rieping

The Paderborn University BSP (PUB) library is a parallel C library based on the BSP model. The basic library supports buffered and unbuffered asynchronous communication between any pair of processors, and a mechanism for synchronizing the processors in a barrier style. In addition, it provides routines for collective communication on arbitrary subsets of processors, partition operations, and a zero-cost synchronization mechanism. Furthermore, some techniques used in its implementation deviate significantly from the techniques used in other BSP libraries.

computing frontiers | 2005

Matrix register file and extended subwords: two techniques for embedded media processors

Asadollah Shahbahrami; Ben H. H. Juurlink; Stamatis Vassiliadis

In this paper we employ two techniques suitable for embedded media processors. The first technique, extended subwords, uses four extra bits for every byte in a media register. This allows many SIMD operations to be performed without overflow and avoids packing/unpacking conversion overhead because of mismatch between storage and computational formats. The second technique, the Matrix Register File (MRF), allows flexible row-wise as well as column-wise access to the register file. It is useful for many block-based multimedia kernels such as (I)DCT, 2x2 Haar Transform, and pixel padding. In addition, we propose a few new media instructions. We employ Modified MMX (MMMX), MMX with extended subwords, to evaluate these techniques. Our results show that MMMX combined with an MRF reduces the dynamic number of instructions by up to 80% compared to other multimedia extensions such as MMX

high performance embedded architectures and compilers | 2008

Parallel H.264 Decoding on an Embedded Multicore Processor

Arnaldo Azevedo; Cor Meenderinck; Ben H. H. Juurlink; Andrei Terechko; Jan Hoogerbrugge; Mauricio Alvarez; Alex Ramirez

In previous work the 3D-Wave parallelization strategy was proposed to increase the parallel scalability of H.264 video decoding. This strategy is based on the observation that inter-frame dependencies have a limited spatial range. The previous results, however, investigate application scalability on an idealized multiprocessor. This work presents an implementation of the 3D-Wave strategy on a multicore architecture composed of NXP TriMedia TM3270 embedded processors. The results show that the parallel H.264 implementation scales very well, achieving a speedup of more than 54 on a 64-core processor. Potential drawbacks of the 3D-Wave strategy are that the memory requirements increase since there can be many frames in flight, and that the latencies of some frames might increase. To address these drawbacks, policies to reduce the number of frames in flight and the frame latency are also presented. The results show that our policies combat memory and latency issues with a negligible effect on the performance scalability.

international symposium on microarchitecture | 2010

The SARC Architecture

Alex Ramirez; Felipe Cabarcas; Ben H. H. Juurlink; Mauricio Alvarez Mesa; Friman Sánchez; Arnaldo Azevedo; Cor Meenderinck; Catalin Bogdan Ciobanu; Sebastian Isaza; Gerogi Gaydadjiev

The SARC architecture is composed of multiple processor types and a set of user-managed direct memory access (DMA) engines that let the runtime scheduler overlap data transfer and computation. The runtime system automatically allocates tasks on the heterogeneous cores and schedules the data transfers through the DMA engines. SARCs programming model supports various highly parallel applications, with matching support from specialized accelerator processors.

international symposium on performance analysis of systems and software | 2013

How a single chip causes massive power bills GPUSimPow: A GPGPU power simulator

Jan Lucas; Sohan Lal; Michael Andersch; Mauricio Alvarez-Mesa; Ben H. H. Juurlink

Modern GPUs are true power houses in every meaning of the word: While they offer general-purpose (GPGPU) compute performance an order of magnitude higher than that of conventional CPUs, they have also been rapidly approaching the infamous “power wall”, as a single chip sometimes consumes more than 300W. Thus, the design space of GPGPU microarchitecture has been extended by another dimension: power. While GPU researchers have previously relied on cycle-accurate simulators for estimating performance during design cycles, there are no simulation tools that include power as well. To mitigate this issue, we introduce the GPUSimPow power estimation framework for GPGPUs consisting of both analytical and empirical models for regular and irregular hardware components. To validate this framework, we build a custom measurement setup to obtain power numbers from real graphics cards. An evaluation on a set of well-known benchmarks reveals an average relative error of 11.7% between simulated and hardware power for GT240 and an average relative error of 10.8% for GTX580. The simulator has been made available to the public [1].

international conference on parallel and distributed systems | 2009

Scalability of Macroblock-level Parallelism for H.264 Decoding

Mauricio Alvarez Mesa; Alex Ramirez; Arnaldo Azevedo; Cor Meenderinck; Ben H. H. Juurlink; Mateo Valero

This paper investigates the scalability of MacroBlock (MB) level parallelization of the H.264 decoder for High Definition (HD) applications. The study includes three parts. First, a formal model for predicting the maximum performance that can be obtained taking into account variable processing time of tasks and thread synchronization overhead. Second, an implementation on a real multiprocessor architecture including a comparison of different scheduling strategies and a profiling analysis for identifying the performance bottlenecks. Finally, a trace-driven simulation methodology has been used for identifying the opportunities of acceleration for removing the main bottlenecks. It includes the acceleration potential for the entropy decoding stage and thread synchronization and scheduling. Our study presents a quantitative analysis of the main bottlenecks of the application and estimates the acceleration levels that are required to make the MB-level parallel decoder scalable.

Explore More