Grigoris Dimitroulakos

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Grigoris Dimitroulakos is active.

Explore More

Publication

Featured researches published by Grigoris Dimitroulakos.

field-programmable custom computing machines | 2005

Accelerating applications by mapping critical kernels on coarse-grain reconfigurable hardware in hybrid systems

Michalis D. Galanis; Grigoris Dimitroulakos; Costas E. Goutis

In this paper, we propose a method for speeding-up applications by partitioning them between the reconfigurable hardware blocks of different granularity and mapping critical parts of applications on the coarse-grain reconfigurable hardware. The partitioning method consists of four steps; the intermediate representation creation, the kernel identification, the mapping onto coarse-grain reconfigurable blocks, and the mapping onto the FPGA hardware. The method is validated using five real-world applications, where the speedup relative to an all-FPGA solution ranges from 1.4 to 3.1.

The Journal of Supercomputing | 2006

Partitioning Methodology for Heterogeneous Reconfigurable Functional Units

Michalis D. Galanis; Grigoris Dimitroulakos; Costas E. Goutis

A partitioning methodology between the reconfigurable hardware blocks of different granularity, which are embedded in a generic heterogeneous architecture, is presented. The fine-grain reconfigurable logic is realized by an FPGA unit, while the coarse-grain reconfigurable hardware by a 2-Dimensional Array of Processing Elements. Critical parts, called kernels, are mapped on the coarse-grain reconfigurable logic for improving performance. The partitioning method is mainly composed by three steps: the analysis of the input code, the mapping onto the Coarse-Grain Reconfigurable Array and the mapping onto the FPGA. The partitioning flow is implemented by a prototype software framework. Analytical partitioning experiments, using five real-world applications, show that the execution time speedup relative to an all-FPGA solution ranges from 1.4 to 5.0.

Integration | 2005

A high-throughput, memory efficient architecture for computing the tile-based 2D discrete wavelet transform for the JPEG2000

Grigoris Dimitroulakos; Michalis D. Galanis; Athanasios Milidonis; Constantinos E. Goutis

In this paper, the design and implementation of an optimized hardware architecture in terms of speed and memory requirements for computing the tile-based 2D forward discrete wavelet transform for the JPEG2000 image compression standard, are described. The proposed architecture is based on a well-known architecture template for calculating the 2D forward discrete wavelet transform. This architecture is derived by replacing the filtering units by our previously published throughput-optimized ones and by developing a scheduling algorithm suited to the special features of our filtering units. The architecture exhibits high-performance characteristics due to the throughput-optimized filters. Also, the extra clock cycles required due to the tile-based version of the discrete wavelet transform are partially compensated by the proper scheduling of the filters. The developed scheduling algorithm results in reduced memory requirements compared with existing architectures.

international conference on digital signal processing | 2002

An efficient VLSI implementation for forward and inverse wavelet transform for JPEG2000

Grigoris Dimitroulakos; N.D. Zervas; Nicolas Sklavos; Costas E. Goutis

In this paper a new methodology for building up filters for the JPEG2000 standard is presented. The proposed filter architectures can be applied efficiently for both 1D forward and inverse wavelet transform implementations. They are characterized by reduced memory accesses, area, power, and the capability of progressive computation. The illustrated VLSI implementation synthesis results prove that the introduced architectures can be embedded in systems which are used in real time applications.

Microprocessors and Microsystems | 2013

Compiling Scilab to high performance embedded multicore systems

Timo Stripf; Oliver Oey; Thomas Bruckschloegl; Juergen Becker; Gerard K. Rauwerda; Kim Sunesen; George Goulas; Panayiotis Alefragis; Nikolaos S. Voros; Steven Derrien; Olivier Sentieys; Nikolaos Kavvadias; Grigoris Dimitroulakos; Kostas Masselos; Dimitrios Kritharidis; Nikolaos Mitas; Thomas Perschke

The mapping process of high performance embedded applications to todays multiprocessor system-on-chip devices suffers from a complex toolchain and programming process. The problem is the expression of parallelism with a pure imperative programming language, which is commonly C. This traditional approach limits the mapping, partitioning and the generation of optimized parallel code, and consequently the achievable performance and power consumption of applications from different domains. The Architecture oriented paraLlelization for high performance embedded Multicore systems using scilAb (ALMA) European project aims to bridge these hurdles through the introduction and exploitation of a Scilab-based toolchain which enables the efficient mapping of applications on multiprocessor platforms from a high level of abstraction. The holistic solution of the ALMA toolchain allows the complexity of both the application and the architecture to be hidden, which leads to better acceptance, reduced development cost, and shorter time-to-market. Driven by the technology restrictions in chip design, the end of exponential growth of clock speeds and an unavoidable increasing request of computing performance, ALMA is a fundamental step forward in the necessary introduction of novel computing paradigms and methodologies.

IEEE Transactions on Very Large Scale Integration Systems | 2007

Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System

Michalis D. Galanis; Grigoris Dimitroulakos; Costas E. Goutis

This paper presents performance improvements and energy savings from mapping real-world benchmarks on an embedded single-chip platform that includes coarse-grained reconfigurable logic with a microprocessor. The reconfigurable hardware is a 2-D array of processing elements connected with a mesh-like network. Analytical results derived from mapping seven real-life digital signal processing applications, with the aid of an automated design flow, on six different instances of the system architecture are presented. Significant overall application speedups relative to an all-software solution, ranging from 1.81 to 3.99 are reported being close to theoretical speedup bounds. Additionally, the energy savings range from 43% to 71%. Finally, a comparison with a system coupling a microprocessor with a very long instruction word core shows that the microprocessor/coarse-grained reconfigurable array platform is more efficient in terms of performance and energy consumption.

computing frontiers | 2007

A unified evaluation framework for coarse grained reconfigurable array architectures

Grigoris Dimitroulakos; Michalis D. Galanis; Nikos Kostaras; Costas E. Goutis

The efficiency of a coarse grained reconfigurable array architecture in terms of performance and hardware cost is hard to be determined. The large number of parameters that define an architecture instance and the mapping complexity makes the evaluation extremely difficult to accomplish without tool assistance. This paper investigates the four factors that are directly related with the efficiency of these architectures namely; the area, the clock frequency, the scheduling efficiency and performance. A unified exploration framework has been build for estimating the values of the 4 aforementioned factors for different architecture alternatives. The exploration framework consists of two parts: a) an existing retargetable compiler from which the mapping efficiency is estimated and b) from the parametric realization of the coarse grained reconfigurable array in hardware description language (VHDL). The latter is used for the estimation of the area and clock frequency of each architecture instance with the realization of the system in the 0.13¼m process of ASIC technology. Also, the experiments refer to different architecture instances in terms of the processing elements. interconnection network, the register files. size, their number of input output ports, and finally the available bandwidth. Totally 72 architecture scenarios have been studied revealing how each characteristic influences performance and area for efficiently make design decisions.

signal processing systems | 2005

Speedups from partitioning software kernels to FPGA hardware in embedded SoCs

Michalis D. Galanis; Grigoris Dimitroulakos; Athanasios P. Kakarountas; Costas E. Goutis

This paper presents a hardware/software partitioning methodology for improving performance in single-chip systems comprised by processor and reconfigurable logic. The reconfigurable logic is realized by field programmable gate array technology. Critical software parts are selected for acceleration on the reconfigurable logic. A generic hybrid system-on-chip platform, which can model the majority of existing processor-FPGA systems, is considered by the method. The partitioning method uses an automated kernel identification process at the basic-block level for detecting critical software portions. Three different instances of the generic platform and two sets of benchmarks are used in the experiments. The analysis on five real-life applications showed that these applications spend an average of 69% of their instruction count in 11% on average of their code. The extensive experimentation illustrates that for the systems composed by 32-bit processors the speedup of five applications ranges from 1.3 to 3.7 relative to an all software solution. For a platform composed by an 8-bit processor, the performance gains of eight DSP algorithms are considerably greater, since the average speedup equals 28.

international parallel and distributed processing symposium | 2007

Speedups and Energy Savings of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path

Michalis D. Galanis; Grigoris Dimitroulakos; Costas E. Goutis

This paper presents the performance improvements and the energy reductions by coupling a high-performance coarse-grained reconfigurable data-path with a microprocessor in a generic platform. The datapath has been previously introduced by the authors. It is composed by computational units able to realize complex operations which aid in improving the performance of time critical application parts, called kernels. A design flow is proposed for mapping high-level software descriptions to the microprocessor system. Eight real-life applications are mapped on three different instances of the system. Significant overall application speedups, relative to a software-only solution, ranging from 1.74 to 3.94 are reported being close to theoretical speedup bounds. Average energy savings of 59% are achieved, while the reduction in the system energy-delay product ranges from 66% to 92%.

The Journal of Supercomputing | 2006

Performance improvements from partitioning applications to FPGA hardware in embedded SoCs

Michalis D. Galanis; Grigoris Dimitroulakos; Costas E. Goutis

A hardware/software partitioning methodology for improving performance in single-chip systems composed by processor and Field Programmable Gate Array reconfigurable logic is presented. Speedups are achieved by executing critical software parts on the reconfigurable logic. A hybrid System-on-Chip platform, which can model the majority of existing processor-FPGA systems, is considered by the methodology. The partitioning method uses an automated kernel identification process at the basic-block level for detecting critical kernels in applications. Three different instances of the generic platform and two sets of benchmarks are used in the experimentation. The analysis on five real-life applications showed that these applications spend an average of 69% of their instruction count in 11% on average of their code. The extensive experiments illustrate that for the systems composed by 32-bit processors the improvements of five applications ranges from 1.3 to 3.7 relative to an all software solution. For a platform composed by an 8-bit processor, the performance gains of eight DSP algorithms are considerably greater, as the average speedup equals 28.

Explore More