Is this you? Create Your Porfile

Nikos Anastopoulos

National Technical University of Athens

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nikos Anastopoulos is active.

Explore More

Publication

Featured researches published by Nikos Anastopoulos.

The Journal of Supercomputing | 2009

Performance evaluation of the sparse matrix-vector multiplication on modern architectures

Georgios I. Goumas; Kornilios Kourtis; Nikos Anastopoulos; Vasileios Karakasis; Nectarios Koziris

In this paper, we revisit the performance issues of the widely used sparse matrix-vector multiplication (SpMxV) kernel on modern microarchitectures. Previous scientific work reports a number of different factors that may significantly reduce performance. However, the interaction of these factors with the underlying architectural characteristics is not clearly understood, a fact that may lead to misguided, and thus unsuccessful attempts for optimization. In order to gain an insight into the details of SpMxV performance, we conduct a suite of experiments on a rich set of matrices for three different commodity hardware platforms. In addition, we investigate the parallel version of the kernel and report on the corresponding performance results and their relation to each architecture’s specific multithreaded configuration. Based on our experiments, we extract useful conclusions that can serve as guidelines for the optimization process of both single and multithreaded versions of the kernel.

parallel, distributed and network-based processing | 2008

Understanding the Performance of Sparse Matrix-Vector Multiplication

Georgios I. Goumas; Kornilios Kourtis; Nikos Anastopoulos; Vasileios Karakasis; Nectarios Koziris

In this paper we revisit the performance issues of the widely used sparse matrix-vector multiplication (SpMxV) kernel on modern microarchitectures. Previous scientific work reports a number of different factors that may significantly reduce performance. However, the interaction of these factors with the underlying architectural characteristics is not clearly understood, a fact that may lead to misguided and thus unsuccessful attempts for optimization. In order to gain an insight on the details of SpMxV performance, we conduct a suite of experiments on a rich set of matrices for three different commodity hardware platforms. Based on our experiments we extract useful conclusions that can serve as guidelines for the subsequent optimization process of the kernel.

The Journal of Supercomputing | 2008

Exploring the performance limits of simultaneous multithreading for memory intensive applications

Evangelia Athanasaki; Nikos Anastopoulos; Kornilios Kourtis; Nectarios Koziris

Abstract Simultaneous multithreading (SMT) has been proposed to improve system throughput by overlapping instructions from multiple threads on a single wide-issue processor. Recent studies have demonstrated that diversity of simultaneously executed applications can bring up significant performance gains due to SMT. However, the speedup of a single application that is parallelized into multiple threads, is often sensitive to its inherent instruction level parallelism (ILP), as well as the efficiency of synchronization and communication mechanisms between its separate, but possibly dependent threads. Moreover, as these separate threads tend to put pressure on the same architectural resources, no significant speedup can be observed. In this paper, we evaluate and contrast thread-level parallelism (TLP) and speculative precomputation (SPR) techniques for a series of memory intensive codes executed on a specific SMT processor implementation. We explore the performance limits by evaluating the tradeoffs between ILP and TLP for various kinds of instruction streams. By obtaining knowledge on how such streams interact when executed simultaneously on the processor, and quantifying their presence within each application’s threads, we try to interpret the observed performance for each application when parallelized according to the aforementioned techniques. In order to amplify this evaluation process, we also present results gathered from the performance monitoring hardware of the processor.

international parallel and distributed processing symposium | 2009

Early experiences on accelerating Dijkstra's algorithm using transactional memory

Nikos Anastopoulos; Konstantinos Nikas; Georgios I. Goumas; Nectarios Koziris

In this paper we use Dijkstras algorithm as a challenging, hard to parallelize paradigm to test the efficacy of several parallelization techniques in a multicore architecture. We consider the application of Transactional Memory (TM) as a means of concurrent accesses to shared data and compare its performance with straightforward parallel versions of the algorithm based on traditional synchronization primitives. To increase the granularity of parallelism and avoid excessive synchronization, we combine TM with Helper Threading (HT). Our simulation results demonstrate that the straightforward parallelization of Dijkstras algorithm with traditional locks and barriers has, as expected, disappointing performance. On the other hand, TM by itself is able to provide some performance improvement in several cases, while the version based on TM and HT exhibits a significant performance improvement that can reach up to a speedup of 1.46.

international conference on parallel architectures and compilation techniques | 2014

LCA: a memory link and cache-aware co-scheduling approach for CMPs

Alexandros-Herodotos Haritatos; Georgios I. Goumas; Nikos Anastopoulos; Konstantinos Nikas; Kornilios Kourtis; Nectarios Koziris

This paper presents LCA, a memory Link and Cache-Aware co-scheduling approach for CMPs. It is based on a novel application classification scheme that monitors resource utilization across the entire memory hierarchy from main memory down to CPU cores. This enables us to predict application interference accurately and support a co-scheduling algorithm that outperforms state-of-the-art scheduling policies both in terms of throughput and fairness. As LCA depends on information collected at runtime by existing monitoring mechanisms of modern processors, it can be easily incorporated in real-life co-scheduling scenarios with various application features and platform configurations.

international conference on cluster computing | 2009

Overlapping computation and communication in SMT clusters with commodity interconnects

Georgios I. Goumas; Nikos Anastopoulos; Nectarios Koziris; Nikolas Ioannou

In this paper we focus on optimizing the performance in a cluster of Simultaneous Multithreading (SMT) processors connected with a commodity interconnect (e.g. Gbit Ethernet), by applying overlapping of computation with communication. As a test case we consider the parallelized advection equation and discuss the steps that need to be followed to semantically allow overlapping to occur. We propose an implementation based on the concept of Helper Threading that distributes computation and communication in the two sibling threads of an SMT processor, thus creating an asymmetric pair of execution patterns in each hardware context. Our experimental results in an 8-node cluster interconnected with commodity Gbit Ethernet demonstrate that the proposed implementation is able to achieve substantial performance improvements that can exceed 20% in some cases, by efficiently utilizing the available resources of the SMT processors.

international parallel and distributed processing symposium | 2012

An Approach to Parallelize Kruskal's Algorithm Using Helper Threads

Anastasios Katsigiannis; Nikos Anastopoulos; Konstantinos Nikas; Nectarios Koziris

In this paper we present a Helper Threading scheme used to parallelize efficiently Kruskals Minimum Spanning Forest algorithm. This algorithm is known for exhibiting inherently sequential characteristics. More specifically, the strict order by which the algorithm checks the edges of a given graph is the main reason behind the lack of explicit parallelism. Our proposed scheme attempts to overcome the imposed restrictions and improve the performance of the algorithm. The results show that for a wide range of graphs of varying structure, size and density the parallelization of Kruskals algorithm is feasible. Observed speedups reach up to 5.5 for 8 running threads, revealing the potentials of our approach.

panhellenic conference on informatics | 2005

Tuning blocked array layouts to exploit memory hierarchy in SMT architectures

Evangelia Athanasaki; Kornilios Kourtis; Nikos Anastopoulos; Nectarios Koziris

Cache misses form a major bottleneck for memory-intensive applications, due to the significant latency of main memory accesses. Loop tiling, in conjunction with other program transformations, have been shown to be an effective approach to improving locality and cache exploitation, especially for dense matrix scientific computations. Beyond loop nest optimizations, data transformation techniques, and in particular blocked data layouts, have been used to boost the cache performance. The stability of performance improvements achieved are heavily dependent on the appropriate selection of tile sizes. In this paper, we investigate the memory performance of blocked data layouts, and provide a theoretical analysis for the multiple levels of memory hierarchy, when they are organized in a set associative fashion. According to this analysis, the optimal tile size that maximizes L1 cache utilization, should completely fit in the L1 cache, even for loop bodies that access more than just one array. Increased self- or/and cross-interference misses can be tolerated through prefetching. Such larger tiles also reduce mispredicted branches and, as a result, the lost CPU cycles that arise. Results are validated through actual benchmarks on an SMT platform.

international conference on parallel processing | 2006

Exploring the Performance Limits of Simultaneous Multithreading for Scientific Codes

Evangelia Athanasaki; Nikos Anastopoulos; Kornilios Kourtis; Nectarios Koziris

Simultaneous multithreading (SMT) has been proposed to improve system throughput by overlapping instructions from multiple threads on a single wide-issue processor. The speedup of a single application that is parallelized into multiple threads, is often sensitive to its inherent instruction level parallelism (ILP), as well as the efficiency of synchronization and communication mechanisms between its separate, but possibly dependent, threads. In this paper, we evaluate and contrast software prefetching and thread-level parallelism (TLP) techniques for a series of scientific codes executed on an SMT processor. We explore the performance limits by evaluating the tradeoffs between ILP and TLP for various kinds of instructions streams. Obtaining knowledge on how such streams interact when executed simultaneously on the processor, and quantifying their presence within each applications threads, we try to interpret the observed performance for each application when parallelized according to the aforementioned techniques. In order to amplify this evaluation process, we also present results gathered from the performance monitoring hardware of the processor

high performance computing and communications | 2006

Exploring the capacity of a modern SMT architecture to deliver high scientific application performance

Evangelia Athanasaki; Nikos Anastopoulos; Kornilios Kourtis; Nectarios Koziris

Simultaneous multithreading (SMT) has been proposed to improve system throughput by overlapping instructions from multiple threads on a single wide-issue processor. Recent studies have demonstrated that heterogeneity of simultaneously executed applications can bring up significant performance gains due to SMT. However, the speedup of a single application that is parallelized into multiple threads, is often sensitive to its inherent instruction level parallelism (ILP), as well as the efficiency of synchronization and communication mechanisms between its separate, but possibly dependent, threads. In this paper, we explore the performance limits by evaluating the tradeoffs between ILP and TLP for various kinds of instructions streams. We evaluate and contrast speculative precomputation (SPR) and thread-level parallelism (TLP) techniques for a series of scientific codes executed on an SMT processor. We also examine the effect of thread synchronization mechanisms on multithreaded parallel applications that are executed on a single SMT processor. In order to amplify this evaluation process, we also present results gathered from the performance monitoring hardware of the processor.

Explore More