Andrei Terechko | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andrei Terechko is active.

Explore More

Publication

Featured researches published by Andrei Terechko.

high performance embedded architectures and compilers | 2008

Parallel H.264 Decoding on an Embedded Multicore Processor

Arnaldo Azevedo; Cor Meenderinck; Ben H. H. Juurlink; Andrei Terechko; Jan Hoogerbrugge; Mauricio Alvarez; Alex Ramirez

In previous work the 3D-Wave parallelization strategy was proposed to increase the parallel scalability of H.264 video decoding. This strategy is based on the observation that inter-frame dependencies have a limited spatial range. The previous results, however, investigate application scalability on an idealized multiprocessor. This work presents an implementation of the 3D-Wave strategy on a multicore architecture composed of NXP TriMedia TM3270 embedded processors. The results show that the parallel H.264 implementation scales very well, achieving a speedup of more than 54 on a 64-core processor. Potential drawbacks of the 3D-Wave strategy are that the memory requirements increase since there can be many frames in flight, and that the latencies of some frames might increase. To address these drawbacks, policies to reduce the number of frames in flight and the frame latency are also presented. The results show that our policies combat memory and latency issues with a negligible effect on the performance scalability.

high performance embedded architectures and compilers | 2008

A Hardware Task Scheduler for Embedded Video Processing

Ghiath Al-Kadi; Andrei Terechko

Modern embedded Systems-on-a-Chip deploy multiple programmable cores to meet increasing performance requirements of video, graphics, and modem applications. However, software implementations of task scheduling and inter-task synchronization often limit performance improvements of multicores. Remarkably, several demanding video applications (e.g. H.264 video decoding) rely on task dependency graphs that can be constructed from a simple dependency pattern. Based on such a pattern, our novel hardware task scheduler can quickly create, order, synchronize and map tasks to cores. We found that our hardware task scheduler speeds up a Quad HD H.264 video decoding by 1.17 times compared to a chip multi-processor with a state-of-the-art hardware task queues. Moreover, our hardware task scheduler allows decreasing the number of cores needed to meet the real-time performance requirements for the H.264 decoder and, consequently, reduces the silicon area of the multicore by up to 12.5%.

digital systems design | 2008

A Look-Ahead Task Management Unit for Embedded Multi-Core Architectures

Magnus Själander; Andrei Terechko; Marc Duranton

Efficient utilization of multi-core architectures relies on the partitioning of applications into tasks and mapping the tasks to cores. In some applications (e.g. H.264 video decoding parallelized at macro-block level) these tasks have dependencies among each other. Task scheduling, consisting of selecting a task with satisfied dependencies and mapping it to a core, is typically a functionality delegated to the operating system. In this paper we present a hardware Task Management Unit (TMU) that looks ahead in time to find tasks to be executed by a multi-core architecture. The look-ahead functionality is shown to reduce the task management overhead by 40-50% when executing a parallelized version of an H.264 video decoder on an architecture with up to 16 cores. In overall, the TMU-based multi-core architecture reaches a speedup of more than 14times on 16 cores running H.264 video decoding, assuming CABAC is implemented in a dedicated coprocessor.

high performance embedded architectures and compilers | 2011

A highly scalable parallel implementation of h.264

Arnaldo Azevedo; Ben H. H. Juurlink; Cor Meenderinck; Andrei Terechko; Jan Hoogerbrugge; Mauricio Alvarez; Alex Ramirez; Mateo Valero

Developing parallel applications that can harness and efficiently use future many-core architectures is the key challenge for scalable computing systems. We contribute to this challenge by presenting a parallel implementation of H.264 that scales to a large number of cores. The algorithm exploits the fact that independent macroblocks (MBs) can be processed in parallel, but whereas a previous approach exploits only intra-frame MB-level parallelism, our algorithm exploits intra-frame as well as inter-frame MB-level parallelism. It is based on the observation that inter-frame dependencies have a limited spatial range. The algorithm has been implemented on a many-core architecture consisting of NXP TriMedia TM3270 embedded processors. This required to develop a subscription mechanism, where MBs are subscribed to the kick-off lists associated with the reference MBs. Extensive simulation results show that the implementation scales very well, achieving a speedup of more than 54 on a 64-core processor, in which case the previous approach achieves a speedup of only 23. Potential drawbacks of the 3D-Wave strategy are that the memory requirements increase since there can be many frames in flight, and that the frame latency might increase. Scheduling policies to address these drawbacks are also presented. The results show that these policies combat memory and latency issues with a negligible effect on the performance scalability. Results analyzing the impact of the memory latency, L1 cache size, and the synchronization and thread management overhead are also presented. Finally, we present performance requirements for entropy (CABAC) decoding.

high performance embedded architectures and compilers | 2011

A multithreaded multicore system for embedded media processing

Jan Hoogerbrugge; Andrei Terechko

We describe a multicore system targeting media processing applications where the cores are multithreaded. The multithreaded cores use a new type of multithreading that we call Subset Static Interleaved (SSI) multithreading. SSI multithreading combines the advantages of blocked multithreading and a simple form of interleaved multithreading called static interleaved multithreading. SSI multithreading divides threads into foreground and background threads and performs static interleaving among the foreground threads. A foreground thread is swapped with a runnable background thread whenever the foreground thread is stalled. SSI multithreading achieves reduced operation latencies, memory latency tolerance, fast context switching, and compared to traditional dynamic interleaving, a relatively low design complexity of the register file. We use a task scheduling unit (TSU) to dispatch tasks to the cores. The TSU is aware of the fact that the cores are multithreaded. This makes a more efficient mapping of tasks to cores possible by scheduling tasks on the least loaded cores. We evaluate the system on an optimized Super HD H.264 decoder where the macroblock decoding and deblocking has been parallelized. The complexity of the H.264 standard and the high resolution makes this a challenging and performance demanding application. We achieve speedups of up to 17.7 times for 16 cores with four threads per core relative to a single-threaded single core. Furthermore, the proposed SSI multithreading achieves a speedup of 1.52 times relative to no multithreading, while blocked multithreading achieves only 1.38 times and a restricted form of interleaved multithreading achieves only 1.37 times speedup.

ACM Transactions on Architecture and Code Optimization | 2007

Inter-cluster communication in VLIW architectures

Andrei Terechko; Henk Corporaal

The traditional VLIW (very long instruction word) architecture with a single register file does not scale up well to address growing performance demands on embedded media processors. However, splitting a VLIW processor in smaller clusters, which are comprised of function units fully connected to local register files, can significantly improve VLSI implementation characteristics of the processor, such as speed, energy consumption, and area. In our paper we reveal that achieving the best characteristics of a clustered VLIW requires a thorough selection of an Inter-cluster Communication (ICC) model, which is the way clustering is exposed in the Instruction Set Architecture. For our study we, first, define a taxonomy of ICC models including copy operations, dedicated issue slots, extended operands, extended results, and multicast. Evaluation of the execution time of the models requires both the dynamic cycle count and clock period. We developed an advanced instruction scheduler for all the five ICC models in order to quantify the dynamic cycle counts of our multimedia C benchmarks. To assess the clock period of the ICC models we designed and laid out VLIW datapaths using the RTL hardware descriptions derived from a deeply pipelined commercial TriMedia processor. In contrast to prior art, our research shows that fully distributed register file architectures (with eight clusters in our study) often underperform compared to moderately clustered machines with two or four clusters because of explosion of the cycle count overhead in the former. Among the evaluated ICC models, performance of the copy operation model, popular both in academia and industry, is severely limited by the copy operations hampering scheduling of regular operations in high ILP (instruction-level parallelism) code. The dedicated issue slots model combats this limitation by dedicating extra VLIW issue slots purely for ICC, reaching the highest 1.74 execution time speedup relative to the unicluster. Furthermore, our VLSI experiments show that the lowest area and energy consumption of 42 and 57% relative to the unicluster, respectively, are achieved by the extended operands model, which, nevertheless, provides higher performance than the copy operation model.

Microprocessors and Microsystems | 2014

Improving the design flow for parallel and heterogeneous architectures running real-time applications

Hector Posadas; Alejandro Nicolás; Pablo Peñil; Eugenio Villar; Florian Broekaert; Michel Bourdelles; Albert Cohen; Mihai Teodor Lazarescu; Luciano Lavagno; Andrei Terechko; Miguel Glassee; Manuel Prieto

In this article, we present the work-in-progress of the EU FP7 PHARAON project, started in September 2011. The first objective of the project is the development of new techniques and tools capable to guide and assist the designer in the development process, from UML specifications to implementation and debug on multicore platform. This tool chain will offer the possibility to propose and implement several parallelization strategies and drive the designer into implementation steps. The second objective of the project is to develop monitoring and control techniques in the middleware of the system capable to automatically adapt platform services to applications requirements and therefore reduce power consumption in a transparent manner for applications.

international conference on consumer electronics | 2010

Meandering based parallel 3DRS algorithm for the multicore era

Ghiath Al-Kadi; Jan Hoogerbrugge; Surendra Guntur; Andrei Terechko; Marc Duranton; Onno Eerenberg

This paper presents a method to parallelize the meandering based 3D recursive search (3DRS) motion estimation algorithm used in scan-rate up-conversion. The proposed algorithm is scalable and can easily be mapped to multiple processing units such as multithreaded processors, multicores and/or co-processors in order to cope up with the increasingly hard to meet real time requirements of next generation video devices. Experiments show that the picture quality of the proposed parallel 3DRS algorithm is as good as the original non-parallelized algorithm for most video sequences.

ACM Transactions in Embedded Computing Systems | 2012

Balancing Programmability and Silicon Efficiency of Heterogeneous Multicore Architectures

Andrei Terechko; Jan Hoogerbrugge; Ghiath Al-Kadi; Surendra Guntur; Anirban Lahiri; Marc Duranton; Clemens C. Wüst; Phillip Christie; Axel Nackaerts; Aatish Kumar

Multicore architectures provide scalable performance with a lower hardware design effort than single core processors. Our article presents a design methodology and an embedded multicore architecture, focusing on reducing the software design complexity and boosting the performance density. First, we analyze characteristics of the Task-Level Parallelism in modern multimedia workloads. These characteristics are used to formulate requirements for the programming model. Then we translate the programming model requirements to an architecture specification, including a novel low-complexity implementation of cache coherence and a hardware synchronization unit. Our evaluation demonstrates that the novel coherence mechanism substantially simplifies hardware design, while reducing the performance by less than 18% relative to a complex snooping technique. Compared to a single processor core, the multicores have already proven to be more area- and energy-efficient. However, the multicore architectures in embedded systems still compete with highly efficient function-specific hardware accelerators. In this article we identify five architectural methods to boost performance density of multicores; microarchitectural downscaling, asymmetric multicore architectures, multithreading, generic accelerators, and conjoining. Then, we present a novel methodology to explore multicore design spaces, including the architectural methods improving the performance density. The methodology is based on a complex formula computing performances of heterogeneous multicore systems. Using this design space exploration methodology for HD and QuadHD H.264 video decoding, we estimate that the required areas of multicores in CMOS 45 nm are 2.5 mm2 and 8.6 mm2, respectively. These results suggest that heterogeneous multicores are cost-effective for embedded applications and can provide a good programmability support.

international electron devices meeting | 2008

Rapid design flows for advanced technology pathfinding

P. Christie; Axel Nackaerts; A. Kumar; Andrei Terechko; G. Doornbos

Several innovative modifications to standard design flows are described which enable new device technologies to be rapidly assessed at the system level. Cell libraries from these rapid flows are employed by a design flow description language (PSYCHIC) for the exploration of highly speculative ldquowhat ifrdquo scenarios. These rapid design flows are used to explore the performance of two competing 15 nm technologies in a system L2 cache controller and a PSYCHIC analysis of statistical timing variations in a 45 nm memory concentrator.

Explore More