Arnaldo Azevedo
Delft University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Arnaldo Azevedo.
signal processing systems | 2009
Cor Meenderinck; Arnaldo Azevedo; Ben H. H. Juurlink; Mauricio Alvarez Mesa; Alex Ramirez
An important question is whether emerging and future applications exhibit sufficient parallelism, in particular thread-level parallelism, to exploit the large numbers of cores future chip multiprocessors (CMPs) are expected to contain. As a case study we investigate the parallelism available in video decoders, an important application domain now and in the future. Specifically, we analyze the parallel scalability of the H.264 decoding process. First we discuss the data structures and dependencies of H.264 and show what types of parallelism it allows to be exploited. We also show that previously proposed parallelization strategies such as slice-level, frame-level, and intra-frame macroblock (MB) level parallelism, are not sufficiently scalable. Based on the observation that inter-frame dependencies have a limited spatial range we propose a new parallelization strategy, called Dynamic 3D-Wave. It allows certain MBs of consecutive frames to be decoded in parallel. Using this new strategy we analyze the limits to the available MB-level parallelism in H.264. Using real movie sequences we find a maximum MB parallelism ranging from 4000 to 7000. We also perform a case study to assess the practical value and possibilities of a highly parallelized H.264 application. The results show that H.264 exhibits sufficient parallelism to efficiently exploit the capabilities of future manycore CMPs.
ACM Sigbed Review | 2013
Kees Goossens; Arnaldo Azevedo; Karthik Chandrasekar; Manil Dev Gomony; Sven Goossens; Martijn Martijn Koedam; Yonghui Li; Davit Davit Mirzoyan; Anca Mariana Molnos; Ashkan Beyranvand Nejad; Andrew Nelson; Ss Shubhendu Sinha
Systems on chip (SOC) contain multiple concurrent applications with different time criticality (firm, soft, non real-time). As a result, they are often developed by different teams or companies, with different models of computation (MOC) such as dataflow, Kahn process networks (KPN), or time-triggered (TT). SOC functionality and (real-time) performance is verified after all applications have been integrated. In this paper we propose the CompSOC platform and design flows that offers a virtual execution platform per application, to allow independent design, verification, and execution. We introduce the composability and predictability concepts, why they help, and how they are implemented in the different resources of the CompSOC architecture. We define a design flow that allows real-time cyclo-static dataflow (CSDF) applications to be automatically mapped, verified, and executed. Mapping and analysis of KPN and TT applications is not automated but they do run composably in their allocated virtual platforms. Although most of the techniques used here have been published in isolation, this paper is the first comprehensive overview of the CompSOC approach. Moreover, three new case studies illustrate all claimed benefits: 1) An example firm-real-time CSDF H.263 decoder is automatically mapped and verified. 2) Applications with different models of computation (CSDF and TT) run composably. 3) Adaptive soft-real-time applications execute composably and can hence be verified independently by simulation.
high performance embedded architectures and compilers | 2008
Arnaldo Azevedo; Cor Meenderinck; Ben H. H. Juurlink; Andrei Terechko; Jan Hoogerbrugge; Mauricio Alvarez; Alex Ramirez
In previous work the 3D-Wave parallelization strategy was proposed to increase the parallel scalability of H.264 video decoding. This strategy is based on the observation that inter-frame dependencies have a limited spatial range. The previous results, however, investigate application scalability on an idealized multiprocessor. This work presents an implementation of the 3D-Wave strategy on a multicore architecture composed of NXP TriMedia TM3270 embedded processors. The results show that the parallel H.264 implementation scales very well, achieving a speedup of more than 54 on a 64-core processor. Potential drawbacks of the 3D-Wave strategy are that the memory requirements increase since there can be many frames in flight, and that the latencies of some frames might increase. To address these drawbacks, policies to reduce the number of frames in flight and the frame latency are also presented. The results show that our policies combat memory and latency issues with a negligible effect on the performance scalability.
international symposium on microarchitecture | 2010
Alex Ramirez; Felipe Cabarcas; Ben H. H. Juurlink; Mauricio Alvarez Mesa; Friman Sánchez; Arnaldo Azevedo; Cor Meenderinck; Catalin Bogdan Ciobanu; Sebastian Isaza; Gerogi Gaydadjiev
The SARC architecture is composed of multiple processor types and a set of user-managed direct memory access (DMA) engines that let the runtime scheduler overlap data transfer and computation. The runtime system automatically allocates tasks on the heterogeneous cores and schedules the data transfers through the DMA engines. SARCs programming model supports various highly parallel applications, with matching support from specialized accelerator processors.
international conference on parallel and distributed systems | 2009
Mauricio Alvarez Mesa; Alex Ramirez; Arnaldo Azevedo; Cor Meenderinck; Ben H. H. Juurlink; Mateo Valero
This paper investigates the scalability of MacroBlock (MB) level parallelization of the H.264 decoder for High Definition (HD) applications. The study includes three parts. First, a formal model for predicting the maximum performance that can be obtained taking into account variable processing time of tasks and thread synchronization overhead. Second, an implementation on a real multiprocessor architecture including a comparison of different scheduling strategies and a profiling analysis for identifying the performance bottlenecks. Finally, a trace-driven simulation methodology has been used for identifying the opportunities of acceleration for removing the main bottlenecks. It includes the acceleration potential for the entropy decoding stage and thread synchronization and scheduling. Our study presents a quantitative analysis of the main bottlenecks of the application and estimates the acceleration levels that are required to make the MB-level parallel decoder scalable.
high performance embedded architectures and compilers | 2011
Arnaldo Azevedo; Ben H. H. Juurlink; Cor Meenderinck; Andrei Terechko; Jan Hoogerbrugge; Mauricio Alvarez; Alex Ramirez; Mateo Valero
Developing parallel applications that can harness and efficiently use future many-core architectures is the key challenge for scalable computing systems. We contribute to this challenge by presenting a parallel implementation of H.264 that scales to a large number of cores. The algorithm exploits the fact that independent macroblocks (MBs) can be processed in parallel, but whereas a previous approach exploits only intra-frame MB-level parallelism, our algorithm exploits intra-frame as well as inter-frame MB-level parallelism. It is based on the observation that inter-frame dependencies have a limited spatial range. The algorithm has been implemented on a many-core architecture consisting of NXP TriMedia TM3270 embedded processors. This required to develop a subscription mechanism, where MBs are subscribed to the kick-off lists associated with the reference MBs. Extensive simulation results show that the implementation scales very well, achieving a speedup of more than 54 on a 64-core processor, in which case the previous approach achieves a speedup of only 23. Potential drawbacks of the 3D-Wave strategy are that the memory requirements increase since there can be many frames in flight, and that the frame latency might increase. Scheduling policies to address these drawbacks are also presented. The results show that these policies combat memory and latency issues with a negligible effect on the performance scalability. Results analyzing the impact of the memory latency, L1 cache size, and the synchronization and thread management overhead are also presented. Finally, we present performance requirements for entropy (CABAC) decoding.
international symposium on circuits and systems | 2008
Arnaldo Azevedo; Cor Meenderinck; Ben H. H. Juurlink; Mauricio Alvarez; Alex Ramirez
In this paper an analysis of bi-dimensional video altering on the cell broadband engine processor is presented. To evaluate the processor, a highly adaptive altering algorithm was chosen: the deblocking filter of the H.264 video compression standard. The baseline version is a scalar implementation extracted from the FFMPEG H.264 decoder. The scalar version was vectorized using the SIMD instructions of the cell synergistic processing element (SPE) and with AltiVec instructions for the Power Processor Element. Results show that approximately one third of the processing time of the SPE SIMD version is used for transposition and data packing and unpacking. Despite the required SIMD overhead and the high adaptivity of the kernel, the SIMD version of the kernel is 2.6 times faster than the scalar versions.
International Journal of Embedded and Real-time Communication Systems | 2010
Arnaldo Azevedo; Ben H. H. Juurlink
In many kernels of multimedia applications, the working set is predictable, making it possible to schedule the data transfers before the computation. Many other kernels, however, process data that is known just before it is needed or have working sets that do not fit in the scratchpad memory. Furthermore, multimedia kernels often access two or higher dimensional data structures and conventional software caches have difficulties to exploit the data locality exhibited by these kernels. For such kernels, the authors present a Multidimensional Software Cache MDSC, which stores 1-4 dimensional blocks to mimic in cache the organization of the data structure. Furthermore, it indexes the cache using the matrix indices rather than linear memory addresses. MDSC also makes use of the lower overhead of Direct Memory Access DMA list transfers and allows exploiting known data access patterns to reduce the number of accesses to the cache. The MDSC is evaluated using GLCM, providing an 8% performance improvement compared to the IBM software cache. For MC, several optimizations are presented that reduce the number of accesses to the MDSC.
international symposium on system-on-chip | 2009
Arnaldo Azevedo; Ben H. H. Juurlink
This paper presents an efficient software cache implementation for H.264 Motion Compensation on scratchpad memory based systems. For a wide range of applications — especially multimedia applications, the data set is predictable, making it possible to transfer the necessary data before the computation. Some kernels, however, depend on data that are known just before they are needed, such as the H.264 Motion Compensation (MC). MC has to stall while the data is transfered from the main memory. To overcome this problem and increase the performance, we analyze the data locality for the MC. Based on this analysis, we propose a 2D Software Cache (2DSC) implementation. The 2DSC exploits the application characteristics to reduce overheads, providing in average 65% improvement over the hand programmed DMAs.
rapid system prototyping | 2007
Vagner S. Rosa; Wagston Tassoni Staehler; Arnaldo Azevedo; Bruno Zatt; Roger Endrigo Carvalho Porto; Luciano Volcan Agostini; Sergio Bampi; Altamiro Amadeu Susin
This paper presents the prototyping strategy used to validate the designed modules of a main profile H.264/AVC video decoder designed to achieve 1080p HDTV resolution, implemented in a FPGA. All modules designed were completely described in VHDL and further validated through simulations. The post place-and-route synthesis results indicate that the designed architectures are able to target real time when processing HDTV 1080p frames (1080times1920). The architectures were prototyped using a Digilent XUP V2P board, containing a Virtex-II Pro XC2VP30 Xilinx FPGA. The prototyping strategy used an embedded Power PC and associated logic and buffering to control the modules under prototyping. A host computer, running the reference software, was used to generate the input stimuli and to compare the results, through a RS-232 serial interface.