Ericles Rodrigues Sousa

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ericles Rodrigues Sousa is active.

Explore More

Publication

Featured researches published by Ericles Rodrigues Sousa.

IEEE Transactions on Computers | 2017

Power Density-Aware Resource Management for Heterogeneous Tiled Multicores

Heba Khdr; Santiago Pagani; Ericles Rodrigues Sousa; Vahid Lari; Anuj Pathania; Frank Hannig; Muhammad Shafique; Jürgen Teich; Jörg Henkel

Increasing power densities have led to the dark silicon era, for which heterogeneous multicores with different power and performance characteristics are promising architectures. This paper focuses on maximizing the overall system performance under a critical temperature constraint for heterogeneous tiled multicores, where all cores or accelerators inside a tile share the same voltage and frequency levels. For such architectures, we present a resource management technique that introduces power density as a novel system level constraint, in order to avoid thermal violations. The proposed technique then assigns applications to tiles by choosing their degree of parallelism and the voltage/frequency levels of each tile, such that the power density constraint is satisfied. Moreover, our technique provides runtime adaptation of the power density constraint according to the characteristics of the executed applications, and reacting to workload changes at runtime. Thus, the available thermal headroom is exploited to maximize the overall system performance.

digital systems design | 2014

Runtime Reconfigurable Bus Arbitration for Concurrent Applications on Heterogeneous MPSoC Architectures

Ericles Rodrigues Sousa; Deepak Gangadharan; Frank Hannig; Jürgen Teich

This paper describes a runtime reconfigurable bus arbitration technique for concurrent applications on heterogeneous MPSoC architectures. Here, a hardware/software approach is introduced as part of a runtime framework that enables selecting and adapting different policies (i. e., fixed-priority, TDMA, and Round-Robin) such that the performance goals of concurrent applications can be satisfied. To evaluate the hardware cost, we compare our proposed solution with respect to a well-known SPARC V8 architecture supporting fixed-priority arbitration. Notably, even providing the flexibility for selecting up to three different policies, our reconfigurable arbiter needs only 25% and 7% more LUTs and slices registers, respectively. The reconfiguration overhead for changing between different policies is 56 cycles and for programming new time slots, only 28 cycles are necessary. For demonstrating the benefits of this reconfiguration framework, we setup a mixed hard/soft real-time scenario by considering four applications with different timeliness requirements. The experimental results show that by reconfiguring the arbiter, less processing elements can be used for achieving a specific target frame rate. Moreover, adjusting the time slots for TDMA, we can speedup a soft real-time algorithm while still satisfying the deadline for hard real-time applications.

PARS-Mitteilungen | 2013

Acceleration of Optical Flow Computations on Tightly-Coupled Processor Arrays

Ericles Rodrigues Sousa; Alexandru Tanase; Vahid Lari; Frank Hannig; Jürgen Teich; Johny Paul; Walter Stechele; Manfred Kröhnert; Tamim Asfour

Optical flow is widely used in many applications of portable mobile devices and automotive embedded systems for the determination of motion of objects in a visual scene. Also in robotics, it is used for motion detection, object segmentation, time-to-contact information, focus of expansion calculations, robot navigation, and automatic parking for vehicles. Similar to many other image processing algorithms, optical flow processes pixel operations repeatedly over whole image frames. Thus, it provides a high degree of fine-grained parallelism which can be efficiently exploited on massively parallel processor arrays. In this context, we propose to accelerate the computation of complex motion estimation vectors on programmable tightly-coupled processor arrays, which offer a high flexibility enabled by coarse-grained reconfiguration capabilities. Novel is also that the degree of parallelism may be adapted to the number of processors that are available to the application. Finally, we present an implementation that is 18 times faster when compared to (a) an FPGA-based soft processor implementation, and (b) may be adapted regarding different QoS requirements, hence, being more flexible than a dedicated hardware implementation.

Information Technology | 2016

Dark silicon management: an integrated and coordinated cross-layer approach

Santiago Pagani; Lars Bauer; Qingqing Chen; Elisabeth Glocker; Frank Hannig; Andreas Herkersdorf; Heba Khdr; Anuj Pathania; Ulf Schlichtmann; Doris Schmitt-Landsiedel; Mark Sagi; Ericles Rodrigues Sousa; Philipp Wagner; Volker Wenzel; Thomas Wild; Jörg Henkel

Abstract This paper presents an integrated and coordinated cross-layer sensing and optimization flow for distributed dark silicon management for tiled heterogeneous manycores under a critical temperature constraint. We target some of the key challenges in dark silicon for manycores, such as: directly focusing on power density/temperature instead of considering simple per-chip power constraints, considering tiled heterogeneous architectures with different types of cores and accelerators, handling the large volumes of raw sensor information, and maintaining scalability. Our solution is separated into three abstraction layers: a sensing layer (involving hardware monitors and pre-processing), a dark silicon layer (that derives thermally-safe mappings and voltage/frequency settings), and an agent layer (used for selecting the parallelism of applications and thread-to-core mapping based on alternatives/constraints from the dark silicon layer).

software and compilers for embedded systems | 2015

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Ericles Rodrigues Sousa; Frank Hannig; Jürgen Teich; Qingqing Chen; Ulf Schlichtmann

Massively Parallel Processor Arrays (MPPAs) can be nicely used in portable devices such as tablets and smartphones. However, applications running on mobile platforms require a certain performance level or quality (e.g., high-resolution image processing) that need to be satisfied while adhering to a certain power budget and temperature threshold. As a solution to the aforementioned challenges, we consider a resource-aware computing paradigm to exploit runtime adaptation without violating any thermal and/or power constraint in a programmable MPPA. For estimating the power consumption, we developed a mathematical model based on the post-synthesis implementation of an MPPA in different CMOS technologies while the temperature variation was emulated. We showcase our hardware/software mechanism to load new, on-the-fly configurations into the accelerator, considering quality/throughput tradeoffs for image processing applications. The results show that the average power consumption of a Sobel and Laplace operators using different number of processing elements amounts to 1.24 mW and 10.35 mW, respectively. Furthermore, only 1.64 μs are necessary for configuring a class of MPPA running at 550 MHz.

international embedded systems symposium | 2015

Reconfigurable Buffer Structures for Coarse-Grained Reconfigurable Arrays

Ericles Rodrigues Sousa; Frank Hannig; Jürgen Teich

Coarse-Grained Reconfigurable Arrays (CGRAs) have emerged as a powerful solution to speedup computationally intensive applications. Heterogeneous MPSoC architectures containing such reconfigurable accelerators have the advantage of providing high flexibility, power-efficiency, and high performance. However, CGRAs may suffer from a data access bottleneck. To mitigate this problem, we present a reconfigurable buffer architecture for CGRAs. Here, the buffers can be configured at runtime to select between different schemes for memory access, i.e., addressable RAMs or pixel buffers. We showcase the benefits of our approach by prototyping a heterogeneous MPSoC architecture containing a RISC processor and a class of CGRA called Tightly Coupled Processor Arrays (TCPAs). The architecture is prototyped in FPGA technology. For basic image processing algorithms, we demonstrate that our proposed buffer structures for system integration allow to increase the memory bandwidth utilization and allow for a performance improvement of up to 7% in comparison to state-of-the-art solutions for image processing.

Journal of Systems Architecture | 2015

Resource-awareness on heterogeneous MPSoCs for image processing

Johny Paul; Walter Stechele; Benjamin Oechslein; Christoph Erhardt; Jens Schedel; Daniel Lohmann; Manfred Kröhnert; Tamim Asfour; Ericles Rodrigues Sousa; Vahid Lari; Frank Hannig; Jürgen Teich; Artjom Grudnitsky; Lars Bauer; Jörg Henkel

Explore the benefits of using heterogeneous MPSoC for image processing.Different types of processing elements leads to programming issues on MPSoC.The conventional scheme (static mapping) does not achieve the best results.A case-study based on resource-aware programming model named Invasive Computing.The resource-aware model leads to better quality and lower latency on MPSoC. Multiprocessor system-on-chip (MPSoC) designs offer a lot of computational power assembled in a compact design. The computing power of MPSoCs can be further augmented by adding massively parallel processor arrays (MPPA) and specialized hardware with instruction-set extensions. On-chip MPPAs can be used to accelerate low-level image-processing algorithms with massive inherent parallelism. However, the presence of multiple processing elements (PEs) with different characteristics raises issues related to programming and application mapping, among others. The conventional approach used for programming heterogeneous MPSoCs results in a static mapping of various parts of the application to different PE types, based on the nature of the algorithm and the structure of the PEs. Yet, such a mapping scheme independent of the instantaneous load on the PEs may lead to under-utilization of some type of PEs while overloading others.In this work, we investigate the benefits of using a heterogeneous MPSoC for accelerating various stages within a real-world image-processing algorithm for object-recognition. A case study demonstrates that a resource-aware programming model called Invasive Computing helps to improve the throughput and worst observed latency of the application program, by dynamically mapping applications to different types of PEs available on a heterogeneous MPSoC.

conference on design and architectures for signal and image processing | 2014

Self-adaptive harris corner detector on heterogeneous many-core processor

Johny Paul; Walter Stechele; Ericles Rodrigues Sousa; Vahid Lari; Frank Hannig; Jürgen Teich; Manfred Kröhnert; Tamim Asfour

The recent years have shown the emergence of heterogeneous system architecture (HSA), which offers massive computational power assembled into a compact design. Computer vision applications with massive inherent parallelism highly benefits from such heterogeneous processors with on-chip CPU and GPU units. The highly parallel and compute intensive parts of the application program can be mapped to the GPU while the control flow and high level tasks may run on the CPU. However, they pose considerable challenge to software development due to their hybrid architecture. Sharing of resources (GPU or CPU) among applications running concurrently, leads to variations in processing interval and prolonged processing intervals leads to low quality results (frame drops) for computer vision algorithms. In this work, we propose resource-awareness and self organisation within the application layer to adapt to available resources on the heterogeneous processor. The benefits of the new model is demonstrated using a widely used computer vision algorithm called Harris corner detector. A resource-aware runtime-system and a heterogeneous processor were used for evaluation and the results indicate a well constrained processing interval and reduced frame-drops. Our evaluations demonstrate up to 20% improvements in processing rate and accuracy of the detected corner points for Harris corner detection.

asilomar conference on signals, systems and computers | 2014

Application-driven reconfiguration of shared resources for timing predictability of MPSoC platforms

Deepak Gangadharan; Ericles Rodrigues Sousa; Vahid Lari; Frank Hannig; Jürgen Teich

The growing demand of computationally intensive algorithms/applications has resulted in the widespread acceptance of heterogeneous MPSoC platforms. The primary reason for this trend is due to the better performance and power efficiency exhibited by heterogeneous architectures consisting of standard processor cores and hardware accelerators. However, multiple processors accessing shared resources such as cache/memory and buses may lead to significant contention on them, thereby decreasing not only the performance, but also timing predictability. Moreover, the effect of shared resource contention worsens in the presence of multiple application scenarios with different execution and communication bandwidth requirements. To mitigate this problem, we first propose a Dynamic Bus Reconfiguration Policy (DBRP) that decides when to reconfigure a shared bus between Non-Preemptive Fixed Priority (NP-FP) and Time-Division Multiple Access (TDMA) scheduling. The required TDMA slot sizes are computed on-the-fly before NP-FP to TDMA reconfiguration such that deadlines of all Hard Real-Time (HRT) applications are satisfied and all Soft Real-Time (SRT) applications are serviced evenly. Our proposed DBRP has been implemented on a real MPSoC platform consisting of cores connected by the AMBA AHB. The case studies demonstrate that reconfiguration of bus arbitration ensures that communication deadline constraints of HRT applications are maximally satisfied with low hardware and reconfiguration overhead.

conference on design and architectures for signal and image processing | 2013