Is this you? Create Your Porfile

Nikolaos Drosinos

National Technical University of Athens

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nikolaos Drosinos is active.

Explore More

Publication

Featured researches published by Nikolaos Drosinos.

international parallel and distributed processing symposium | 2004

Performance comparison of pure MPI vs hybrid MPI-OpenMP parallelization models on SMP clusters

Nikolaos Drosinos; Nectarios Koziris

Summary form only given. We compare the performance of three programming paradigms for the parallelization of nested loop algorithms onto SMP clusters. More specifically, we propose three alternative models for tiled nested loop algorithms, namely a pure message passing paradigm, as well as two hybrid ones, that implement communication both through message passing and shared memory access. The hybrid models adopt an advanced hyperplane scheduling scheme, that allows both for minimal thread synchronization, as well as for pipelined execution with overlapping of computation and communication phases. We focus on the experimental evaluation of all three models, and test their performance against several iteration spaces and parallelization grains with the aid of a typical microkernel benchmark. We conclude that the hybrid models can in some cases be more beneficial compared to the monolithic pure message passing model, as they exploit better the configuration characteristics of an hierarchical parallel platform, such as an SMP cluster.

international conference on cluster computing | 2002

Compiling tiled iteration spaces for clusters

G. Gournas; Nikolaos Drosinos; Maria Athanasaki; Nectarios Koziris

We present a complete end-to-end framework to generate automatic message-passing code for tiled iteration spaces. We consider general parallelepiped tiling transformations and general convex iteration spaces. We aim to address all problems concerning data parallel code generation efficiently by transforming the initial non-rectangular tile to a rectangular one. In this way, data distribution and communication become simple and straightforward. We have implemented our parallelizing techniques in a tool which automatically generates MPI code and run several experiments on a cluster of PCs. Our experimental results show the merit of general parallelepiped tiling transformations, and confirm previous theoretical work on scheduling-optimal tile shapes.

acm symposium on applied computing | 2004

Automatic parallel code generation for tiled nested loops

Georgios I. Goumas; Nikolaos Drosinos; Maria Athanasaki; Nectarios Koziris

This paper presents an overview of our work, concerning a complete end-to-end framework for automatically generating message passing parallel code for tiled nested for-loops. It considers general parallelepiped tiling transformations and general convex iteration spaces. We address all problems regarding both the generation of sequential tiled code and its parallelization. We have implemented our techniques in a tool which automatically generates MPI parallel code and conducted several series of experiments, concerning the compilation time of our tool, the efficiency of the generated code and the speedup attained on a cluster of PCs. Apart from confirming the value of our techniques, our experimental results show the merit of general parallelepiped tiling transformations and verify previous theoretical work on scheduling-optimal tile shapes.

IEEE Transactions on Parallel and Distributed Systems | 2009

Communication-Aware Supernode Shape

Georgios I. Goumas; Nikolaos Drosinos; Nectarios Koziris

In this paper we revisit the supernode-shape selection problem, that has been widely discussed in bibliography. In general, the selection of the supernode transformation greatly affects the parallel execution time of the transformed algorithm. Since the minimization of the overall parallel execution time via an appropriate supernode transformation is very difficult to accomplish, researchers have focused on scheduling-aware supernode transformations that maximize parallelism during the execution. In this paper we argue that the communication volume of the transformed algorithm is an important criterion, and its minimization should be given high priority. For this reason we define the metric of the per process communication volume and propose a method to minimize this metric by selecting a communication-aware supernode shape. Our approach is equivalent to defining a proper Cartesian process grid with MPI_Cart_Create, which means that it can be incorporated in applications in a straightforward manner. Our experimental results illustrate that by selecting the tile shape with the proposed method, the total parallel execution time is significantly reduced due to the minimization of the communication volume, despite the fact that a few more parallel execution steps are required.

parallel computing | 2006

Message-passing code generation for non-rectangular tiling transformations

Georgios I. Goumas; Nikolaos Drosinos; Maria Athanasaki; Nectarios Koziris

Tiling is a well known loop transformation used to reduce communication overhead in distributed memory machines. Although a lot of theoretical research has been done concerning the selection of proper tile shapes that reduce processor idle times, there is no complete approach to automatically parallelize non-rectangularly tiled iteration spaces and consequently there are no actual experimental results to verify previous theoretical work on the effect of the tile shape on the overall completion time of a tiled algorithm. This paper presents a complete end-to-end framework to generate automatic message-passing code for tiled iteration spaces. It considers general parallelepiped tiling transformations and convex iteration spaces. We aim to address all problems concerning data parallel code generation efficiently by transforming the initial non-rectangular tile to a rectangular one. In this way, data distribution and the respective communication pattern become simple and straightforward. We have implemented our parallelizing techniques in a tool which automatically generates MPI code and run several benchmarks on a cluster of PCs. Our experimental results show the merit of general parallelepiped tiling transformations, and verify previous theoretical work on scheduling-optimal, non-rectangular tile shapes.

international parallel and distributed processing symposium | 2007

Coarse-grain Parallel Execution for 2-dimensional PDE Problems

Georgios I. Goumas; Nikolaos Drosinos; Vasileios Karakasis; Nectarios Koziris

This paper presents a new approach for the execution of coarse-grain (tiled) parallel SPMD code for applications derived from the explicit discretization of 1-dimensional PDE problems with finite-differencing schemes. Tiling transformation is an efficient loop transformation to achieve coarse-grain parallelism in such algorithms, while rectangular tile shapes are the only feasible shapes that can be manually applied by program developers. However, rectangular tiling transformations are not always valid due to data dependencies, and thus requiring the application of an appropriate skewing transformation prior to tiling in order to enable rectangular tile shapes. We employ cyclic mapping of tiles to processes and propose a method to determine an efficient rectangular tiling transformation for a fixed number of processes for 2-dimensional, skewed PDE problems. Our experimental results confirm the merit of coarse-grain execution in this family of applications and indicate that the proposed method leads to the selection of highly efficient tiling transformations.

The Journal of Supercomputing | 2006

The Effect of Process Topology and Load Balancing on Parallel Programming Models for SMP Clusters and Iterative Algorithms

Nikolaos Drosinos; Nectarios Koziris

This article focuses on the effect of both process topology and load balancing on various programming models for SMP clusters and iterative algorithms. More specifically, we consider nested loop algorithms with constant flow dependencies, that can be parallelized on SMP clusters with the aid of the tiling transformation. We investigate three parallel programming models, namely a popular message passing monolithic parallel implementation, as well as two hybrid ones, that employ both message passing and multi-threading. We conclude that the selection of an appropriate mapping topology for the mesh of processes has a significant effect on the overall performance, and provide an algorithm for the specification of such an efficient topology according to the iteration space and data dependencies of the algorithm. We also propose static load balancing techniques for the computation distribution between threads, that diminish the disadvantage of the master thread assuming all inter-process communication due to limitations often imposed by the message passing library. Both improvements are implemented as compile-time optimizations and are further experimentally evaluated. An overall comparison of the above parallel programming styles on SMP clusters based on micro-kernel experimental evaluation is further provided, as well.

international conference on parallel processing | 2005

Load balancing hybrid programming models for SMP clusters and fully permutable loops

Nikolaos Drosinos; Nectarios Koziris

This paper emphasizes on load balancing issues associated with hybrid programming models for the parallelization of fully permutable nested loops onto SMP clusters. Hybrid parallel programming models usually suffer from intrinsic load imbalance between threads, mainly because most existing message passing libraries generally provide limited multi-threading support, allowing only the master thread to perform internode message passing communication. In order to mitigate this effect, the authors proposed a generic method for the application of static load balancing on the coarse-grain hybrid model for the appropriate distribution of the computational load to the working threads. The efficiency of the proposed scheme was experimentally evaluated against a micro-kernel benchmark, and demonstrated the potential of such load balancing schemes for the extraction of maximum performance out of hybrid parallel programs.

international parallel and distributed processing symposium | 2006

Selecting the tile shape to reduce the total communication volume

Nikolaos Drosinos; Georgios I. Goumas; Nectarios Koziris

In this paper, we revisit the tile-shape selection problem, that has been extensively discussed in bibliography. An efficient approach is proposed for the selection of a suitable tile shape, based on the minimization of the process communication volume. We consider the large family of applications that arise from the discretization of partial differential equations (PDEs). Practical experience has shown that for such applications and distributed memory architectures, minimizing the total communication volume is more important than minimizing the total number of parallel execution steps. We formulate a new method to determine an appropriate communication-aware tile shape, i.e. the one that reduces the communication volume for a fixed number of processes. Our approach is equivalent to defining a proper Cartesian process grid with MPI_Cart_Create, which means that it can be incorporated in applications in a straightforward manner. Our experimental results illustrate that by selecting the tile shape with the proposed method, the total parallel execution time is significantly reduced due to the minimization of the communication volume, despite the fact that a few more parallel execution steps are required

Lecture Notes in Computer Science | 2005

Computing frequent itemsets in parallel using partial support trees

Dora Souliou; Aris Pagourtzis; Nikolaos Drosinos

A key process in association rules mining, which has attracted a lot of interest during the last decade, is the discovery of frequent sets of items in a database of transactions. A number of sequential algorithms have been proposed that accomplish this task. In this paper we study the parallelization of the partial-support-tree approach (Goulbourne, Coenen, Leng, 2000). Results show that this method achieves a generally satisfactory speedup, while it is particularly adequate for certain types of datasets.

Explore More