Vasilios I. Kelefouras
University of Patras
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Vasilios I. Kelefouras.
The Journal of Supercomputing | 2015
Vasilios I. Kelefouras; Angeliki Kritikakou; Elissavet Papadima; Constantinos E. Goutis
In this paper, a new methodology for computing the Dense Matrix Vector Multiplication, for both embedded (processors without SIMD unit) and general purpose processors (single and multi-core processors, with SIMD unit), is presented. This methodology achieves higher execution speed than ATLAS state-of-the-art library (speedup from 1.2 up to 1.45). This is achieved by fully exploiting the combination of the software (e.g., data reuse) and hardware parameters (e.g., data cache associativity) which are considered simultaneously as one problem and not separately, giving a smaller search space and high-quality solutions. The proposed methodology produces a different schedule for different values of the (i) number of the levels of data cache; (ii) data cache sizes; (iii) data cache associativities; (iv) data cache and main memory latencies; (v) data array layout of the matrix and (vi) number of cores.
ACM Computing Surveys | 2013
Angeliki Kritikakou; Francky Catthoor; Vasilios I. Kelefouras; Costas E. Goutis
The scheduling problem is an important partially solved topic related to a wide range of scientific fields. As it applies to design-time mapping on multiprocessing platforms emphasizing on ordering in time and assignment in place, significant improvements can be achieved. To support this improvement, this article presents a complete systematic classification of the existing scheduling techniques solving this problem in a (near-)optimal way. We show that the proposed approach covers any global scheduling technique, including also future ones. In our systematic classification a technique may belong to one primitive class or to a hybrid combination of such classes. In the latter case the technique is efficiently decomposed into more primitive components each one belonging to a specific class. The systematic classification assists in the in-depth understanding of the diverse classes of techniques which is essential for their further improvement. Their main characteristics and structure, their similarities and differences, and the interrelationships of the classes are conceived. In this way, our classification provides guidance for contributing in novel ways to the broad domain of global scheduling techniques.
The Journal of Supercomputing | 2014
Vasilios I. Kelefouras; Angeliki Kritikakou; Costas E. Goutis
In this paper, a new methodology for speeding up Matrix–Matrix Multiplication using Single Instruction Multiple Data unit, at one and more cores having a shared cache, is presented. This methodology achieves higher execution speed than ATLAS state of the art library (speedup from 1.08 up to 3.5), by decreasing the number of instructions (load/store and arithmetic) and the data cache accesses and misses in the memory hierarchy. This is achieved by fully exploiting the software characteristics (e.g. data reuse) and hardware parameters (e.g. data caches sizes and associativities) as one problem and not separately, giving high quality solutions and a smaller search space.
The Journal of Supercomputing | 2016
Vasilios I. Kelefouras; Angeliki Kritikakou; Iosif Mporas; Vasileios Kolonias
Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like matrix–matrix multiplication (MMM). A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task, since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome this problem, the scheduling sub-problems must be optimized together, as one problem and not separately. In this paper, an MMM methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically, while the major scheduling sub-problems are addressed together as one problem and not separately according to the hardware architecture parameters and input size; for different hardware architecture parameters and/or input sizes, a different implementation is produced. This is achieved by fully exploiting the software characteristics (e.g., data reuse) and hardware architecture parameters (e.g., data caches sizes and associativities), giving high-quality solutions and a smaller search space. This methodology refers to a wide range of CPU and GPU architectures.
The Journal of Supercomputing | 2014
Vasilios I. Kelefouras; Angeliki Kritikakou; Costas E. Goutis
In this paper, a new methodology for speeding up edge and line detection algorithms is presented, achieving improved performance over the state of the art software library OpenCV (speedup from 1.35 up to 2.22) and other conventional implementations, in both general and embedded processors, by reducing the number of load/store and arithmetic instructions, the number of data cache accesses and data cache misses in memory hierarchy and the algorithm memory size. This is achieved by fully exploiting the combination of the software and hardware parameters which are considered simultaneously as one problem and not separately. Furthermore, the edge and line detection algorithms have been simplified for a computer vision application in a Virtex-5 Xilinx FPGA using Microblaze soft processor (detection and measurement of flow fronts in a microfluid device); it achieves speedup up to 660 times in comparison with conventional software implementations.
ACM Transactions on Architecture and Code Optimization | 2014
Angeliki Kritikakou; Francky Catthoor; Vasilios I. Kelefouras; Costas E. Goutis
Memory management searches for the resources required to store the concurrently alive elements. The solution quality is affected by the representation of the element accesses: a sub-optimal representation leads to overestimation and a non-scalable representation increases the exploration time. We propose a methodology to near-optimal and scalable represent regular and irregular accesses. The representation consists of a set of pattern entries to compactly describe the behavior of the memory accesses and of pattern operations to consistently combine the pattern entries. The result is a final sequence of pattern entries which represents the global access scheme without unnecessary overestimation.
ACM Transactions on Design Automation of Electronic Systems | 2013
Angeliki Kritikakou; Francky Catthoor; Vasilios I. Kelefouras; Costas E. Goutis
Storage-size management techniques aim to reduce the resources required to store elements and to concurrently provide efficient addressing during element accessing. Existing techniques are less appropriate for large iteration spaces with increased numbers of irregularly spread holes. They either have to approximate the accessed regions, leading to overestimation of the final resources, or they require prohibited exploration time to find the storage size. In this work, we present a near-optimal and scalable methodology for storage-size, intrasignal, in-place optimization, that is, to compute the minimum amount of resources required to store the elements of a group (array), for irregular complex access schemes in the target domain of non-overlapping store and load accesses.
signal processing systems | 2014
Vasilios I. Kelefouras; Angeliki Kritikakou; Konstantinos Siourounis; Costas E. Goutis
The Matrix Vector Multiplication algorithm is an important kernel in most varied domains and application areas and the performance of its implementations highly depends on the memory utilization and data locality. In this paper, a new methodology for MVM including different types of matrices, i.e. Regular, Toeplitz and Bisymmetric Toeplitz, is presented in detail. This methodology achieves higher execution speed than the software state of the art library, ATLAS (speedup from 1.2 up to 4.4), and other conventional software implementations, for both general (SIMD unit is used) and embedded processors. This is achieved by fully and simultaneously exploiting the combination of software and hardware parameters as one problem and not separately.
ieee computer society annual symposium on vlsi | 2017
Vasilios I. Kelefouras; Georgios Keramidas; Nikolaos S. Voros
In this paper, we present a new methodology that provides i) a theoretical analysis of the two most commonly used approaches for effective shared cache management (i.e., cache partitioning and loop tiling) and ii) a unified framework to fine tuning those two mechanisms in tandem (not separately). Our approach manages to lower the number of main memory accesses by one order of magnitude keeping at the same time the number of arithmetical/addressing instructions in a minimal level. We also present a search space exploration analysis where our proposal is able to offer a vast deduction in the required search space.
Computing | 2017
Vasilios I. Kelefouras
Today’s compilers have a plethora of optimizations-transformations to choose from, and the correct choice, order as well parameters of transformations have a significant/large impact on performance; choosing the correct order and parameters of optimizations has been a long standing problem in compilation research, which until now remains unsolved; the separate sub-problems optimization gives a different schedule/binary for each sub-problem and these schedules cannot coexist, as by refining one degrades the other. Researchers try to solve this problem by using iterative compilation techniques but the search space is so big that it cannot be searched even by using modern supercomputers. Moreover, compiler transformations do not take into account the hardware architecture details and data reuse in an efficient way. In this paper, a new iterative compilation methodology is presented which reduces the search space of six compiler transformations by addressing the above problems; the search space is reduced by many orders of magnitude and thus an efficient solution is now capable to be found. The transformations are the following: loop tiling (including the number of the levels of tiling), loop unroll, register allocation, scalar replacement, loop interchange and data array layouts. The search space is reduced (a) by addressing the aforementioned transformations together as one problem and not separately, (b) by taking into account the custom hardware architecture details (e.g., cache size and associativity) and algorithm characteristics (e.g., data reuse). The proposed methodology has been evaluated over iterative compilation and gcc/icc compilers, on both embedded and general purpose processors; it achieves significant performance gains at many orders of magnitude lower compilation time.