Is this you? Create Your Porfile

Vahid Lari

University of Erlangen-Nuremberg

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vahid Lari is active.

Explore More

Publication

Featured researches published by Vahid Lari.

asia and south pacific design automation conference | 2012

Invasive manycore architectures

Jörg Henkel; Andreas Herkersdorf; Lars Bauer; Thomas Wild; Michael Hübner; Ravi Kumar Pujari; Artjom Grudnitsky; Jan Heisswolf; Aurang Zaib; Benjamin Vogel; Vahid Lari; Sebastian Kobbe

This paper introduces a scalable hardware and software platform applicable for demonstrating the benefits of the invasive computing paradigm. The hardware architecture consists of a heterogeneous, tile-based manycore structure while the software architecture comprises a multi-agent management layer underpinned by distributed runtime and OS services. The necessity for invasive-specific hardware assist functions is analytically shown and their integration into the overall manycore environment is described.

application specific systems architectures and processors | 2011

Decentralized dynamic resource management support for massively parallel processor arrays

Vahid Lari; Andriy Narovlyanskyy; Frank Hannig; Jürgen Teich

This paper presents a hardware-supported resource management methodology for massively parallel processor arrays. It enables processing elements to autonomously explore resource availability in their neighborhood. To support resource exploration, we introduce specialized controllers, which can be attached to each of the processing elements. We propose different types of architectures for the exploration controller: fast FSM-based designs as well as flexible programmable controllers. These controllers allow to implement different distributed resource exploration strategies in order to enable parallel programs the exploration and reservation of available resources according to different application requirements. Hardware cost evaluations show that the cost of the simplest implementation of our programmable controller is comparable to our FSM-based implementations, while offering the flexibility for implementing different exploration strategies. We show that the proposed distributed approach can achieve a significant speedup in comparison with centralized resource exploration methods.

ACM Transactions on Design Automation of Electronic Systems | 2013

Hierarchical power management for adaptive tightly-coupled processor arrays

Vahid Lari; Shravan Muddasani; Srinivas Boppu; Frank Hannig; Moritz Schmid; Jürgen Teich

We present a self-adaptive hierarchical power management technique for massively parallel processor architectures, supporting a new resource-aware parallel computing paradigm called invasive computing. Here, an application can dynamically claim, execute, and release the resources in three phases: resource acquisition (invade), program loading/configuration and execution (infect), and release (retreat). Resource invasion is governed by dedicated decentralized hardware controllers, called invasion controllers (ictrls), which are integrated into each processing element (PE). Several invasion strategies for claiming linearly connected or rectangular regions of processing resources are implemented. The key idea is to exploit the decentralized resource management inherent to invasive computing for power savings by enabling applications themselves to control the power for processing resources and invasion controllers using a hierarchical power-gating approach. We propose analytical models for estimating various components of energy consumption for faster design space exploration and compare them with the results obtained from a cycle-accurate C++ simulator of the processor array. In order to find optimal design trade-offs, various parameters like (a) energy consumption, (b) hardware cost, and (c) timing overheads are compared for different sizes of power domains. Experimental results show significant energy savings (up to 73%) for selected characteristical algorithms and different resource utilizations. In addition, we demonstrate the accuracy of our proposed analytical model. Here, estimation errors less than 3.6% can be reported.

computing frontiers | 2013

System integration of tightly-coupled processor arrays using reconfigurable buffer structures

Frank Hannig; Moritz Schmid; Vahid Lari; Srinivas Boppu; Jürgen Teich

As data locality is a key factor for the acceleration of loop programs on processor arrays, we propose a buffer architecture that can be configured at run-time to select between different schemes for memory access. In addition to traditional address-based memory banks, the buffer architecture can deliver data in a streaming manner to the processing elements of the array, which supports dense and sparse stencil operations. Moreover, to minimize data transfers to the buffers, the design contains an interlinked mode, which is especially targeted at 2-D kernel computations. The buffers can be used individually to achieve high data throughput by utilizing a maximum number of I/O channels to the array, or concatenated to provide higher storage capacity at a reduced amount of I/O channels.

IEEE Transactions on Computers | 2017

Power Density-Aware Resource Management for Heterogeneous Tiled Multicores

Heba Khdr; Santiago Pagani; Ericles Rodrigues Sousa; Vahid Lari; Anuj Pathania; Frank Hannig; Muhammad Shafique; Jürgen Teich; Jörg Henkel

Increasing power densities have led to the dark silicon era, for which heterogeneous multicores with different power and performance characteristics are promising architectures. This paper focuses on maximizing the overall system performance under a critical temperature constraint for heterogeneous tiled multicores, where all cores or accelerators inside a tile share the same voltage and frequency levels. For such architectures, we present a resource management technique that introduces power density as a novel system level constraint, in order to avoid thermal violations. The proposed technique then assigns applications to tiles by choosing their degree of parallelism and the voltage/frequency levels of each tile, such that the power density constraint is satisfied. Moreover, our technique provides runtime adaptation of the power density constraint according to the characteristics of the executed applications, and reacting to workload changes at runtime. Thus, the available thermal headroom is exploited to maximize the overall system performance.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Distributed Resource Reservation in Massively Parallel Processor Arrays

Vahid Lari; Frank Hannig; Jürgen Teich

This paper proposes a methodology for applications to automatically claim linear arrays of processing elements within massively parallel processor arrays at run-time depending on the available degree of parallelism or dynamic computing requirements. Using this methodology, parallel programs running on individual processing elements gain the capability of autonomously managing the available processing resources in their neighborhood. We present different protocols and architectural support for gathering and transporting the result of a resource exploration for informing a configuration loader about the number and location of the claimed resources. Timing and data overhead cost of four different approaches are mathematically evaluated. In order to verify and compare these decentralized algorithms, a simulation platform has been developed to compare the data overhead and scalability of each approach for different sizes of processor arrays.

application-specific systems, architectures, and processors | 2015

On-demand fault-tolerant loop processing on massively parallel processor arrays

Alexandru Tanase; Michael Witterauf; Jürgen Teich; Frank Hannig; Vahid Lari

We present a compilation-based technique for providing on-demand structural redundancy for massively parallel processor arrays. Thereby, application programmers gain the capability to trade throughput for reliability according to application requirements. To protect parallel loop computations against errors, we propose to apply the well-known fault tolerance schemes dual modular redundancy (DMR) and triple modular redundancy (TMR) to a whole region of the processor array rather than individual processing elements. At the source code level, the compiler realizes these replication schemes with a program transformation that: (1) replicates a parallel loop program two or three times for DMR or TMR, respectively, and (2) introduces appropriate voting operations whose frequency and location may be chosen from three proposed variants. Which variant to choose depends, for example, on the error resilience needs of the application or the expected soft error rates. Finally, we explore the different tradeoffs of these variants in terms of performance overheads and error detection latency.

adaptive hardware and systems | 2015

A co-design approach for fault-tolerant loop execution on Coarse-Grained Reconfigurable Arrays

Vahid Lari; Alexandru Tanase; Jürgen Teich; Michael Witterauf; Faramarz Khosravi; Frank Hannig; Brett H. Meyer

We present a co-design approach to establish redundancy schemes such as Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR) to a whole region of a processor array for a class of Coarse-Grained Reconfigurable Arrays (CGRAs). The approach is applied to applications with mixed-criticality properties and experiencing varying Soft Error Rates (SERs) due to environmental reasons, e. g., changing altitude. The core idea is to adapt the degree of fault protection for loop programs executing in parallel on a CGRA to the level of reliability required as well as SER profiles. This is realized through claiming neighbor regions of processing elements for the execution of replicated loop nests. First, at the source code level, a compiler transformation is proposed that realizes these replication schemes in two steps: (1) replicate given parallel loop program two or three times for DMR or TMR, respectively, and (2) add appropriate error handling functions (voting or comparison) in order to detect respectively correct any single errors. Then, using the opportunities of hardware/software co-design, we propose optimized implementations of the error handling functions in software as well as in hardware. Finally, experimental results are given for the analysis of reliability gains for each proposed scheme of array replication in dependence of different SERs.

PARS-Mitteilungen | 2013

Acceleration of Optical Flow Computations on Tightly-Coupled Processor Arrays

Ericles Rodrigues Sousa; Alexandru Tanase; Vahid Lari; Frank Hannig; Jürgen Teich; Johny Paul; Walter Stechele; Manfred Kröhnert; Tamim Asfour

Optical flow is widely used in many applications of portable mobile devices and automotive embedded systems for the determination of motion of objects in a visual scene. Also in robotics, it is used for motion detection, object segmentation, time-to-contact information, focus of expansion calculations, robot navigation, and automatic parking for vehicles. Similar to many other image processing algorithms, optical flow processes pixel operations repeatedly over whole image frames. Thus, it provides a high degree of fine-grained parallelism which can be efficiently exploited on massively parallel processor arrays. In this context, we propose to accelerate the computation of complex motion estimation vectors on programmable tightly-coupled processor arrays, which offer a high flexibility enabled by coarse-grained reconfiguration capabilities. Novel is also that the degree of parallelism may be adapted to the number of processors that are available to the application. Finally, we present an implementation that is 18 times faster when compared to (a) an FPGA-based soft processor implementation, and (b) may be adapted regarding different QoS requirements, hence, being more flexible than a dedicated hardware implementation.

Archive | 2016

Invasive Tightly Coupled Processor Arrays

Vahid Lari

In this chapter, after introducing the principles of invasive computing and a considered multi-processor system-on-a-chip (MPSoC) architecture, we dig into deeper details by introducing Tightly Coupled Processor Arrays (TCPAs), a class of coarse-grained reconfigurable processor arrays. After briefly explaining our loop mapping methodology on such architectures, we make the following contributions for realising invasive computing concepts on TCPAs: (a) development of ultra fast, distributed, and hardware-based resource invasion strategies to acquire regions of Processing Elements (PEs) of different shapes and sizes. (b) Proposing two different design variants for realising invasion strategies at the hardware level, and evaluate their timing overheads as well as hardware costs. (c) Investigation of different signalling concepts and data structure to collect information about the number and the location of invaded PEs. (d) Development of the hardware/software interfaces for integrating TCPAs into a tiled architecture, and finally, (e) evaluation of the hardware costs and timing overheads based on prototype implementations on the basis of FPGA hardware

Explore More