Jan Hoogerbrugge | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jan Hoogerbrugge is active.

Explore More

Publication

Featured researches published by Jan Hoogerbrugge.

Software - Practice and Experience | 1999

A code compression system based on pipelined interpreters

Jan Hoogerbrugge; Lex Augusteijn; Jeroen Trum; Rik van de Wiel

This paper describes a system for compressed code generation. The code of applications is partioned into time‐critical and non‐time‐critical code. Critical code is compiled to native code, and non‐critical code is compiled to a very dense virtual instruction set which is executed on a highly optimized interpreter. The system employs dictionary‐based compression by means of superinstructions which correspond to patterns of frequently used base instructions. The code compression system is designed for the Philips TriMedia VLIW processor. The interpreter is pipelined to achieve a high interpretation speed. The pipeline consists of three stages: fetch, decode, and execute. While one instruction is being executed, the next instruction is decoded, and the next one after that is fetched from memory. On a TriMedia VLIW with a load latency of three cycles and a jump latency of four cycles, the interpreter achieves a peak performance of four cycles per instruction and a sustained performance of 6.27 cycles per instruction. Experiments are described that demonstrate the compression quality of the system and the execution speed of the pipelined interpreter; these were found to be about five times more compact than native TriMedia code and a slowdown of about eight times, respectively. Copyright

international symposium on vlsi technology systems and applications | 2001

Homogeneous multiprocessing and the future of silicon design paradigms

Paul Stravers; Jan Hoogerbrugge

This paper addresses two challenges of the consumer semiconductor industry: (1) economical and social forces are increasingly reducing the length of product life cycles, and (2) the continuing exponential growth of the on-chip transistor count is pushing design complexity. In concert these two trends represent a formidable challenge for semiconductor companies that aim to benefit from future technological developments in highly competitive markets. The paper derives a relation between on-chip memory real estate and compute logic, suggesting that homogeneous multiprocessors are an unavoidable consequence of the technology curve. A particular approach to homogeneous multiprocessing is then presented that combines scalability with high computational performance and with high power efficiency. We also present the implementation of a programming paradigm for homogeneous multiprocessors that focuses on reuse of tested and approved functions at the software level. This enables a shift from todays not-so-successful practice of hardware core reuse to the reuse of functions that have very well defined and uniform interfaces. The time frame for large scale commercial application of this type of homogeneous multiprocessor architecture is expected to coincide with the arrival of 0.07 micron technology for consumer products, i.e. 2006 and beyond. The paper concludes with a case study of an MPEG2 decoder and how a few simple guidelines can significantly increase the exposed concurrency of the application.

high performance embedded architectures and compilers | 2008

Parallel H.264 Decoding on an Embedded Multicore Processor

Arnaldo Azevedo; Cor Meenderinck; Ben H. H. Juurlink; Andrei Terechko; Jan Hoogerbrugge; Mauricio Alvarez; Alex Ramirez

In previous work the 3D-Wave parallelization strategy was proposed to increase the parallel scalability of H.264 video decoding. This strategy is based on the observation that inter-frame dependencies have a limited spatial range. The previous results, however, investigate application scalability on an idealized multiprocessor. This work presents an implementation of the 3D-Wave strategy on a multicore architecture composed of NXP TriMedia TM3270 embedded processors. The results show that the parallel H.264 implementation scales very well, achieving a speedup of more than 54 on a 64-core processor. Potential drawbacks of the 3D-Wave strategy are that the memory requirements increase since there can be many frames in flight, and that the latencies of some frames might increase. To address these drawbacks, policies to reduce the number of frames in flight and the frame latency are also presented. The results show that our policies combat memory and latency issues with a negligible effect on the performance scalability.

high performance embedded architectures and compilers | 2011

A highly scalable parallel implementation of h.264

Arnaldo Azevedo; Ben H. H. Juurlink; Cor Meenderinck; Andrei Terechko; Jan Hoogerbrugge; Mauricio Alvarez; Alex Ramirez; Mateo Valero

Developing parallel applications that can harness and efficiently use future many-core architectures is the key challenge for scalable computing systems. We contribute to this challenge by presenting a parallel implementation of H.264 that scales to a large number of cores. The algorithm exploits the fact that independent macroblocks (MBs) can be processed in parallel, but whereas a previous approach exploits only intra-frame MB-level parallelism, our algorithm exploits intra-frame as well as inter-frame MB-level parallelism. It is based on the observation that inter-frame dependencies have a limited spatial range. The algorithm has been implemented on a many-core architecture consisting of NXP TriMedia TM3270 embedded processors. This required to develop a subscription mechanism, where MBs are subscribed to the kick-off lists associated with the reference MBs. Extensive simulation results show that the implementation scales very well, achieving a speedup of more than 54 on a 64-core processor, in which case the previous approach achieves a speedup of only 23. Potential drawbacks of the 3D-Wave strategy are that the memory requirements increase since there can be many frames in flight, and that the frame latency might increase. Scheduling policies to address these drawbacks are also presented. The results show that these policies combat memory and latency issues with a negligible effect on the performance scalability. Results analyzing the impact of the memory latency, L1 cache size, and the synchronization and thread management overhead are also presented. Finally, we present performance requirements for entropy (CABAC) decoding.

international conference on parallel architectures and compilation techniques | 2000

Dynamic Branch Prediction for a VLIW Processor

Jan Hoogerbrugge

This paper describes the design of a dynamic branch predictor for a VLIW processor. The developed branch predictor predicts the direction of a branch, i.e., taken or not taken, and in the case of taken prediction it also predicts the issue-slot that contains the taken branch. This information is used to perform the BTB lookup. We compare this method against a typical superscalar branch predictor and against a branch predictor developed for VLIWs by Intel and HP. For a 2K entry BHT, 512 entry BTB, gshare branch predictor we obtain a next pc misprediction rate of 7.83%, while a traditional superscalar-type branch predictor of comparable costs achieves 10.3% and the Intel/HP predictor achieves 9.31%. In addition, we propose to have both predicted and delayed branches in the ISA and let the compiler select which type to apply. Simulations show performance improvements of 2-7% for benchmarks that are well-known for their high misprediction rates. This paper also contributes an experiment to determine whether speculative update in the fetch stage and correction of mispredictions is really necessary for VLIWs, instead of updating when branches are resolved. Experiments show that the performance advantage of speculative updating is small.

compiler construction | 2000

Pipelined Java Virtual Machine Interpreters

Jan Hoogerbrugge; Lex Augusteijn

The performance of a Java Virtual Machine (JVM) interpreter running on a very long instruction word (VLIW) processor can be improved by means of pipelining. While one bytecode is in its execute stage, the next bytecode is in its decode stage, and the next bytecode is in its fetch stage. The paper describes how we implemented threading and pipelining by rewriting the source code of the interpreter and several modifications in the compiler. Experiments for evaluating the effectiveness of pipelining are described. Pipelining improves the execution speed of a threaded interpreter by 19.4% in terms of instruction count and 14.4% in terms of cycle count. Most of the simple bytecodes, like additions and multiplications, execute in four cycles. This number corresponds to the branch latency of our target VLIW processor. Thus most of the code of the interpreter is executed in branch delay slots.

Archive | 2005

Cache-Coherent Heterogeneous Multiprocessing as Basis for Streaming Applications

Jos van Eijndhoven; Jan Hoogerbrugge; M.N. Jayram; Paul Stravers; Andrei Terechko

Systems-on-Chip (SoC) of the new generation will be extremely complex devices, composed from complex subsystems, relying on abstraction from implementation details. These chips will support the execution of a mix of concurrent applications that are not known in detail at chip design time. These SoCs require a significant degree of programmability to configure both the set of functions that must execute as well as the structure of the dataflow between these functions. To ease the programming effort multiprocessor computers have employed cache coherent share memory for decades, abstracting the average programmer from system complexity issues such as multiple processors and memory hierarchies. Memory coherency in multiprocessor computers has a history of decades, and has proven to be an indispensable abstraction from system complexity towards the application programmer. This chapter describes a next generation SoC for the consumer electronics domain (e.g. audio/video, vision, robotics). It features heterogeneous multiprocessor subsystems with a snooping cache coherence protocol, combined in a system with distributed memory employing a directory coherency protocol. It is explained why and how the coherent memory model is indispensable for implementing both data transport and synchronization for multi-tasking streaming applications in distributed memory systems.

field programmable logic and applications | 2000

Compiling Applications for ConCISe: An Example of Automatic HW/SW Partitioning and Synthesis

Bernardo Kastrup; Jjc Jeroen Trum; Orlando Moreira; Jan Hoogerbrugge; J Jef van Meerbergen

In the ConCISe project, an embedded programmable processor is augmented with a Reconfigurable Functional Unit (RFU) based on Field-Programmable Logic (FPL), in a technique that aims at being cost-effective for high volume production. The target domain is embedded encryption. In this paper, we focus on ConCISes programming tool-set. A smart assembler, capable of automatically performing HW/SW partitioning and HW synthesis, generates the custom operations that are implemented in the RFU. Benchmarks carried out with ConCISes simulators show that the RFU may speed up off-the-shelf encryption applications by as much as 50%, for a modest investment in silicon, and with no changes in the traditional application programming flow.

european conference on parallel processing | 2000

Cost-Efficient Branch Target Buffers

Jan Hoogerbrugge

Branch target buffers (BTBs) are caches in which branch information is stored that is used for branch prediction by the fetch stage of the instruction pipeline. A typical BTB requires a few kbyte of storage which makes it rather large and, because it is accessed every cycle, rather power consuming. Partial resolution has in the past been proposed to reduce the size of a BTB. A partial resolution BTB stores not all tag bits that would be required to do an exact lookup. The result is a smaller BTB at the price of slightly less accurate branch prediction. This paper proposes to make use of branch locality to reduce the size of a BTB. Short-distance branches need fewer BTB bits than long-distance branches that are less frequent. Two BTB organisations are presented that use branch locality. Simulation results are given that demonstrate the effectiveness of the described techniques.

Archive | 2000