Cor Meenderinck
Delft University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Cor Meenderinck.
international conference on supercomputing | 2010
Chi Ching Chi; Ben H. H. Juurlink; Cor Meenderinck
How to develop efficient and scalable parallel applications is the key challenge for emerging many-core architectures. We investigate this question by implementing and comparing two parallel H.264 decoders on the Cell architecture. It is expected that future many-cores will use a Cell-like local store memory hierarchy, rather than a non-scalable shared memory. The two implemented parallel algorithms, the Task Pool (TP) and the novel Ring-Line (RL) approach, both exploit macroblock-level parallelism. The TP implementation follows the master-slave paradigm and is very dynamic so that in theory perfect load balancing can be achieved. The RL approach is distributed and more predictable in the sense that the mapping of macroblocks to processing elements is fixed. This allows to better exploit data locality, to overlap communication with computation, and to reduce communication and synchronization overhead. While TP is more scalable in theory, the actual scalability favors RL. Using 16 SPEs, RL obtains a scalability of 12x, while TP achieves only 10.3x. More importantly, the absolute performance of RL is much higher. Using 16 SPEs, RL achieves a throughput of 139.6 frames per second (fps) while TP achieves only 76.6 fps. A large part of the additional performance advantage is due to hiding the memory latency. From the results we conclude that in order to fully leverage the performance of future many-cores, a centralized master should be avoided and the mapping of tasks to cores should be predictable in order to be able to hide the memory latency.
international symposium on circuits and systems | 2008
Arnaldo Azevedo; Cor Meenderinck; Ben H. H. Juurlink; Mauricio Alvarez; Alex Ramirez
In this paper an analysis of bi-dimensional video altering on the cell broadband engine processor is presented. To evaluate the processor, a highly adaptive altering algorithm was chosen: the deblocking filter of the H.264 video compression standard. The baseline version is a scalar implementation extracted from the FFMPEG H.264 decoder. The scalar version was vectorized using the SIMD instructions of the cell synergistic processing element (SPE) and with AltiVec instructions for the Power Processor Element. Results show that approximately one third of the processing time of the SPE SIMD version is used for transposition and data packing and unpacking. Despite the required SIMD overhead and the high adaptivity of the kernel, the SIMD version of the kernel is 2.6 times faster than the scalar versions.
Scalable Parallel Programming Applied to H.264/AVC Decoding | 2012
Ben H. H. Juurlink; Mauricio Alvarez-Mesa; Chi Ching Chi; Arnaldo Azevedo; Cor Meenderinck; Alex Ramirez
Existing software applications should be redesigned if programmers want to benefit from the performance offered by multi- and many-core architectures. Performance scalability now depends on the possibility of finding and exploiting enough Thread-Level Parallelism (TLP) in applications for using the increasing numbers of cores on a chip. Video decoding is an example of an application domain with increasing computational requirements every new generation. This is due, on the one hand, to the trend towards high quality video systems (high definition and frame rate, 3D displays, etc) that results in a continuous increase in the amount of data that has to be processed in real-time. On the other hand, there is the requirement to maintain high compression efficiency which is only possible with video codes like H.264/AVC that use advanced coding techniques. In this book, the parallelization of H.264/AVC decoding is presented as a case study of parallel programming. H.264/AVC decoding is an example of a complex application with many levels of dependencies, different kernels, and irregular data structures. The book presents a detailed methodology for parallelization of this type of applications. Itbegins witha description of the algorithm, an analysis of the data dependencies and an evaluation of the different parallelization strategies. Then the design and implementation of a novel parallelization approach is presented that is scalable to many core architectures. Experimental results on different parallel architectures are discussed in detail. Finally, an outlook is given on parallelization opportunities in the upcoming HEVC standard.
ACM Sigarch Computer Architecture News | 2012
Ben H. H. Juurlink; Cor Meenderinck
Several recent works predict the future of multicore systems or identify scalability bottlenecks based on Amdahls law. Amdahls law implicitly assumes, however, that the problem size stays constant, but in most cases more cores are used to solve larger and more complex problems. There is a related law known as Gustafsons law which assumes that runtime, not the problem size, is constant. In other words, it is assumed that the runtime on p cores is the same as the runtime on 1 core and that the parallel part of an application scales linearly with the number of cores. We apply Gustafsons law to symmetric, asymmetric, and dynamic multicores and show that this leads to fundamentally different results than when Amdahls law is applied. We also generalize Amdahls and Gustafsons law and study how this quantitatively effects the dimensioning of future multicore systems.
digital systems design | 2010
Cor Meenderinck; Ben H. H. Juurlink
StarSS is a parallel programming model that eases the task of the programmer. He or she has to identify the tasks that can potentially be executed in parallel and the inputs and outputs of these tasks, while the runtime system takes care of the difficult issues of determining inter task dependencies, synchronization, load balancing, scheduling to optimize data locality, etc. Given these issues, however, the runtime system might become a bottleneck that limits the scalability of the system. The contribution of this paper is two-fold. First, we analyze the scalability of the current software runtime system for several synthetic benchmarks with different dependency patterns and task sizes. We show that for fine-grained tasks the system does not scale beyond five cores. Furthermore, we identify the main scalability bottlenecks of the runtime system. Second, we present the design of Nexus, a hardware support system for StarSS applications, that greatly reduces the task management overhead.
european conference on parallel processing | 2009
Cor Meenderinck; Ben H. H. Juurlink
The power wall is currently one of the major obstacles computer architecture is facing. In this paper we analyze the impact of the power wall on CMP design. As a case study we model a CMP consisting of Alpha 21264 cores, scaled to future technology nodes according to the ITRS roadmap. When running at the maximum clock frequency, such a CMP would far exceed the power budget. Although power limits performance significantly, technology improvements will still provide performance growth. Amdahls Law highly threatens this performance growth, but might not be valid for all application domains. In those cases Gustafsons Law could be valid which is much more optimistic. From our results we derive some principles to prevent CMPs from hitting the power wall.
Archive | 2012
Ben H. H. Juurlink; Mauricio Alvarez-Mesa; Chi Ching Chi; Arnaldo Azevedo; Cor Meenderinck; Alex Ramirez
Before any attempt to parallelize an application can be made, it is necessary to understand the application. Therefore, in this chapter we present a brief overview of the state-of-the-art H.264/AVC video coding standard. The H.264/AVC standard is based on the same hybrid structure as previous standards, but contains several new coding tools that increase the coding efficiency and quality. These new features increase the computational complexity of video encoding as well as decoding, however. Therefore, parallelism is a solution to obtain the performance required for real-time processing. The goal of this chapter is not to provide a detailed overview of H.264/AVC, but to provide sufficient background to be able to understand the remaining chapters.
digital systems design | 2011
Cor Meenderinck; Ben H. H. Juurlink
To improve the programmability of multicores, several task-based programming models have recently been proposed. Inter-task dependencies have to be resolved by either the programmer or a software runtime system, increasing the respectively. In this paper we therefore propose the Nexus hardware task management support system. Based on the inputs and outputs of tasks, it dynamically detects dependencies between tasks and schedules ready tasks for execution. In addition, it provides fast and scalable synchronization. Experiments show that compared to a software runtime system, Nexus improves the task by a factor of 54 times. As a consequence much finer-grained tasks and/or many more cores can be efficiently employed. example, for H.264 decoding, which has an average task size 8.1us, Nexus scales up to more than 12 cores, while when using the software approach, the scalability saturates at below three cores.
digital systems design | 2012
Cor Meenderinck; Anca Mariana Molnos; Kees Goossens
Systems on a Chip concurrently execute multiple applications that may start and stop at run-time, creating many use-cases. Composability reduces the verifcation effort, by making the functional and temporal behaviours of an application independent of other applications. Existing approaches link applications to static address ranges that cannot be reused between applications that are not simultaneously active, wasting resources. In this paper we propose a composable virtual memory scheme that enables dynamic binding and relocation of applications. Our virtual memory is also predictable, for applications with real-time constraints. We integrated the virtual memory on, CompSOC, an existing composable SoC prototyped in FPGA. The implementation indicates that virtual memory is in general expensive, because it incurs a performance loss around 39% due to address translation latency. On top of this, composability adds to virtual memory an insigni cant extra performance penalty, below 1%.
european conference on parallel processing | 2010
Martijn Briejer; Cor Meenderinck; Ben H. H. Juurlink
Energy-efficient dynamic branch predictors are proposed for the Cell SPE, which normally depends on compiler-inserted hint instructions to predict branches. All designed schemes use a Branch Target Buffer (BTB) to store the branch target address and the prediction, which is computed using a bimodal counter. One prediction scheme predecodes instructions when they are fetched from the local store and accesses the BTB only for branch instructions, thereby saving power compared to conventional dynamic predictors that access the BTB for every instruction. In addition, several ways to leverage the existing hint instructions for the dynamic branch predictor are studied. We also introduce branch warning instructions which initiate branch prediction before the actual branch instruction is fetched. They allow fetching the instructions starting at the branch target and thus completely remove the branch penalty for correctly predicted branches. For a 256-entry BTB, a speedup of up to 18.8% is achieved. The power consumption of the branch prediction schemes is estimated at 1% or less of the total power dissipation of the SPE and the average energy-delay product is reduced by up to 6.2%.