Alexandru Tanase
University of Erlangen-Nuremberg
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Alexandru Tanase.
application specific systems architectures and processors | 2013
Jürgen Teich; Alexandru Tanase; Frank Hannig
In this paper, we present a first solution to the unsolved problem of joint tiling and scheduling a given loop nest with uniform data dependencies symbolically. This problem arises for loop programs for which the iterations shall be optimally scheduled on a processor array of unknown size at compile-time. Still, we show that it is possible to derive parameterized latencyoptimal schedules statically by proposing two new program transformations: In the first step, the iteration space is tiled symbolically into orthotopes of parametrized extensions. The resulting tiled program is subsequently scheduled symbolically. Here, we show that the maximal number of potential optimal schedules is upper bounded by 2nn! where n is the dimension of the loop nest. However, the real number of optimal schedule candidates being much less than this. At run-time, once the size of the processor array becomes known, simple comparisons of latency-determining expressions finally steer which of these schedules will be dynamically activated and the corresponding program configuration executed on the resulting processor array so to avoid any further run-time optimization or expensive recompilations.
signal processing systems | 2014
Jürgen Teich; Alexandru Tanase; Frank Hannig
In this paper, we present a solution to the problem of joint tiling and scheduling a given loop nest with uniform data dependencies symbolically. This challenge arises when the size and number of available processors for parallel loop execution is not known at compile time. But still, in order to avoid any overhead of dynamic (run-time) recompilation, a schedule of loop iterations shall be computed and optimized statically. In this paper, it will be shown that it is possible to derive parameterized latency-optimal schedules statically by proposing a two step approach: First, the iteration space of a loop program is tiled symbolically into orthotopes of parametrized extensions. Subsequently, the resulting tiled program is also scheduled symbolically, resulting in a set of latency-optimal parameterized schedule candidates. At run time, once the size of the processor array becomes known, simple comparisons of latency-determining expressions finally steer which of these schedules will be dynamically selected and the corresponding program configuration executed on the resulting processor array so to avoid any further run-time optimization or expensive recompilation. Our theory of symbolic loop parallelization is applied to a number of loop programs from the domains of signal processing and linear algebra. Finally, as a proof of concept, we demonstrate our proposed methodology for a massively parallel processor array architecture called tightly coupled processor array (TCPA) on which applications may dynamically claim regions of processors in the context of invasive computing.
application-specific systems, architectures, and processors | 2014
Moritz Schmid; Alexandru Tanase; Frank Hannig; Jürgen Teich; Vivek Singh Bhadouria; Dibyendu Ghoshal
High-Level Synthesis (HLS) has become a very popular instrument to facilitate rapid development of production-ready implementations for FPGAs. Ever increasing flexibility of the frameworks, however, demands a very high level of domain-specific knowledge from the designer. Examples for such knowledge in window-based image processing are median computation and border handling. Depending on the size of the considered window, writing the code to perform such operations may become overwhelming even at very high abstraction levels. To increase productivity and to make the underlying architecture accessible to non-experts, we propose to combine HLS with domain-specific augmentations. Specifically, we propose a new language extension in form of a reduction for sorting and median computation. Furthermore, we introduce a new high-level transformation to perform multiple kinds of border treatment automatically. Both augmentations may reduce the required amount of code lines considerably. The increase in productivity is analyzed by comparing the lines of code necessary to specify a median filter for HLS in PAULA for synthesis using PARO and in C++ for synthesis using a commercial HLS tool.
application-specific systems, architectures, and processors | 2015
Alexandru Tanase; Michael Witterauf; Jürgen Teich; Frank Hannig; Vahid Lari
We present a compilation-based technique for providing on-demand structural redundancy for massively parallel processor arrays. Thereby, application programmers gain the capability to trade throughput for reliability according to application requirements. To protect parallel loop computations against errors, we propose to apply the well-known fault tolerance schemes dual modular redundancy (DMR) and triple modular redundancy (TMR) to a whole region of the processor array rather than individual processing elements. At the source code level, the compiler realizes these replication schemes with a program transformation that: (1) replicates a parallel loop program two or three times for DMR or TMR, respectively, and (2) introduces appropriate voting operations whose frequency and location may be chosen from three proposed variants. Which variant to choose depends, for example, on the error resilience needs of the application or the expected soft error rates. Finally, we explore the different tradeoffs of these variants in terms of performance overheads and error detection latency.
adaptive hardware and systems | 2015
Vahid Lari; Alexandru Tanase; Jürgen Teich; Michael Witterauf; Faramarz Khosravi; Frank Hannig; Brett H. Meyer
We present a co-design approach to establish redundancy schemes such as Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR) to a whole region of a processor array for a class of Coarse-Grained Reconfigurable Arrays (CGRAs). The approach is applied to applications with mixed-criticality properties and experiencing varying Soft Error Rates (SERs) due to environmental reasons, e. g., changing altitude. The core idea is to adapt the degree of fault protection for loop programs executing in parallel on a CGRA to the level of reliability required as well as SER profiles. This is realized through claiming neighbor regions of processing elements for the execution of replicated loop nests. First, at the source code level, a compiler transformation is proposed that realizes these replication schemes in two steps: (1) replicate given parallel loop program two or three times for DMR or TMR, respectively, and (2) add appropriate error handling functions (voting or comparison) in order to detect respectively correct any single errors. Then, using the opportunities of hardware/software co-design, we propose optimized implementations of the error handling functions in software as well as in hardware. Finally, experimental results are given for the analysis of reliability gains for each proposed scheme of array replication in dependence of different SERs.
international conference on formal methods and models for co-design | 2014
Alexandru Tanase; Michael Witterauf; Jürgen Teich; Frank Hannig
This paper presents a first solution to the unsolved problem of symbolically scheduling a given loop nest with uniform data dependences using inner loop parallelization, in particular, the locally parallel, globally sequential (LPGS) mapping technique. This technique is needed in the case of loop program specifications for which the iterations shall be scheduled on a processor array of unknown size at compile time while keeping the local memory consumption independent of the problem size of the mapped loop nest. We show that it is possible to derive such parameterized LPGS schedules statically by proposing a mixed compile-/runtime approach: At compile time, we first determine the set of all schedule candidates, each latency-optimal for a different scanning order of the loop nest. Then we devise an exact parameterized formula for determining the latency of the resulting symbolic schedules, thus making each schedule fully predictable. At runtime, once the size of the processor array becomes known, a simple prolog selects the overall latency-optimal schedule that is then dynamically activated and executed on the processor array. Hence, our approach avoids any further runtime optimization and expensive re-compilations while achieving the same results as computing an optimal static schedule for each possible combination of array and problem size.
PARS-Mitteilungen | 2013
Ericles Rodrigues Sousa; Alexandru Tanase; Vahid Lari; Frank Hannig; Jürgen Teich; Johny Paul; Walter Stechele; Manfred Kröhnert; Tamim Asfour
Optical flow is widely used in many applications of portable mobile devices and automotive embedded systems for the determination of motion of objects in a visual scene. Also in robotics, it is used for motion detection, object segmentation, time-to-contact information, focus of expansion calculations, robot navigation, and automatic parking for vehicles. Similar to many other image processing algorithms, optical flow processes pixel operations repeatedly over whole image frames. Thus, it provides a high degree of fine-grained parallelism which can be efficiently exploited on massively parallel processor arrays. In this context, we propose to accelerate the computation of complex motion estimation vectors on programmable tightly-coupled processor arrays, which offer a high flexibility enabled by coarse-grained reconfiguration capabilities. Novel is also that the degree of parallelism may be adapted to the number of processors that are available to the application. Finally, we present an implementation that is 18 times faster when compared to (a) an FPGA-based soft processor implementation, and (b) may be adapted regarding different QoS requirements, hence, being more flexible than a dedicated hardware implementation.
Information Technology | 2016
Vahid Lari; Andreas Weichslgartner; Alexandru Tanase; Michael Witterauf; Faramarz Khosravi; Jürgen Teich; Jan Heißwolf; Stephanie Friederich; Jürgen Becker
Abstract As a consequence of technology scaling, todays complex multi-processor systems have become more and more susceptible to errors. In order to satisfy reliability requirements, such systems require methods to detect and tolerate errors. This entails two major challenges: (a) providing a comprehensive approach that ensures fault-tolerant execution of parallel applications across different types of resources, and (b) optimizing resource usage in the face of dynamic fault probabilities or with varying fault tolerance needs of different applications. In this paper, we present a holistic and adaptive approach to provide fault tolerance on Multi-Processor System-on-a-Chip (MPSoC) on demand of an application or environmental needs based on invasive computing. We show how invasive computing may provide adaptive fault tolerance on a heterogeneous MPSoC including hardware accelerators and communication infrastructure such as a Network-on-Chip (NoC). In addition, we present (a) compile-time transformations to automatically adopt well-known redundancy schemes such as Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR) for fault-tolerant loop execution on a class of massively parallel arrays of processors called as Tightly Coupled Processor Arrays (). Based on timing characteristics derived from our compilation flow, we further develop (b) a reliability analysis guiding the selection of a suitable degree of fault tolerance. Finally, we present (c) a methodology to detect and adaptively mitigate faults in invasive NoCs.
adaptive hardware and systems | 2015
Michael Witterauf; Alexandru Tanase; Jürgen Teich; Vahid Lari; Andreas Zwinkau; Gregor Snelting
Fault tolerance is a basic necessity to make todays complex systems reliable. Adequate fault tolerance, however, demands a high degree of redundancy, possibly wasting resources when the fault probability is low or when some applications do not require fault tolerance. Under the term adaptive fault tolerance, we investigate means to instead provide on-demand fault tolerance on multi-core systems dynamically and according to application and environmental needs. Such means are provided on a per-application basis by invasive computing, a recent paradigm for resource-aware programming and design of parallel systems: applications request resources in an invade phase, infect the acquired resources with code and data, and finally release them in a retreat phase. We show how to use these simple but powerful constructs to adaptively tolerate faults and that invasive computing harmonizes well with many existing fault tolerance approaches. Finally, a case study on adaptively providing fault tolerance for loops demonstrates how effective invasive computing is for adapting to a varying soft error rate and handling of faults.
signal processing systems | 2017
Vivek Singh Bhadouria; Alexandru Tanase; Moritz Schmid; Frank Hannig; Jürgen Teich; Dibyendu Ghoshal
Images are often corrupted with noise during the image acquisition and transmission stage. Here, we propose a novel approach for the reduction of random-valued impulse noise in images and its hardware implementation on various state-of-the-art FPGAs. The presented algorithm consists of two stages in which the first stage detects whether pixels have been corrupted by impulse noise and the second stage performs a filtering operation on the detected noisy pixels. The human visual system is sensitive to the presence of edges in any image therefore the filtering stage consists of an edge preserving median filter which performs the filtering operation while preserving the underlying fine image features. Experimentally, it has been found that the proposed scheme yields a better Peak Signal-to-Noise Ratio (PSNR) compared to other existing median-based impulse noise filtering schemes. The algorithm is implemented using the high-level synthesis tool PARO as a highly parallel and deeply pipelined hardware design that simultaneously exploits loop level as well as instruction level parallelism with a very short latency of only few milliseconds for 16 bit images of size 512 × 512 pixels.