Is this you? Create Your Porfile

Konstantinos Nikas

National Technical University of Athens

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Konstantinos Nikas is active.

Explore More

Publication

Featured researches published by Konstantinos Nikas.

international conference on embedded computer systems: architectures, modeling, and simulation | 2008

An adaptive bloom filter cache partitioning scheme for multicore architectures

Konstantinos Nikas; Matthew Horsnell; Jim D. Garside

This paper investigates the problem of partitioning the last-level shared cache of multicore architectures. Contention for such a shared resource has been shown to severely degrade performance when running multiple applications. As architectures incorporate more cores, multiple application workloads become increasingly attractive, further exacerbating contention at the last-level cache. Today, cache replacement policies, extensively studied for uniprocessor systems, are being employed within new multicore architectures with little, if any, adaptation. However the parameters in these new systems are likely to be different. The least recently used (LRU) policy, for example, which is widely accepted as the best replacement policy in uniprocessor caches, often results in poor resource sharing in a multicore system, signalling the importance of reevaluating the effectiveness of these policies in the new architectures. This paper proposes adaptive bloom filter cache partitioning (ABFCP), a low-cost, dynamic cache partitioning mechanism capable of better resource sharing at the last-level cache than LRU, improving the performance of an eight-core system on average by 5.92% over the LRU policy. Moreover, the proposed scheme provides the equivalent performance benefits that could be gained from almost a 50% increase in the last-level cache and shows increasing benefit as the number of cores rises.

international parallel and distributed processing symposium | 2009

Early experiences on accelerating Dijkstra's algorithm using transactional memory

Nikos Anastopoulos; Konstantinos Nikas; Georgios I. Goumas; Nectarios Koziris

In this paper we use Dijkstras algorithm as a challenging, hard to parallelize paradigm to test the efficacy of several parallelization techniques in a multicore architecture. We consider the application of Transactional Memory (TM) as a means of concurrent accesses to shared data and compare its performance with straightforward parallel versions of the algorithm based on traditional synchronization primitives. To increase the granularity of parallelism and avoid excessive synchronization, we combine TM with Helper Threading (HT). Our simulation results demonstrate that the straightforward parallelization of Dijkstras algorithm with traditional locks and barriers has, as expected, disappointing performance. On the other hand, TM by itself is able to provide some performance improvement in several cases, while the version based on TM and HT exhibits a significant performance improvement that can reach up to a speedup of 1.46.

international conference on parallel architectures and compilation techniques | 2014

LCA: a memory link and cache-aware co-scheduling approach for CMPs

Alexandros-Herodotos Haritatos; Georgios I. Goumas; Nikos Anastopoulos; Konstantinos Nikas; Kornilios Kourtis; Nectarios Koziris

This paper presents LCA, a memory Link and Cache-Aware co-scheduling approach for CMPs. It is based on a novel application classification scheme that monitors resource utilization across the entire memory hierarchy from main memory down to CPU cores. This enables us to predict application interference accurately and support a co-scheduling algorithm that outperforms state-of-the-art scheduling policies both in terms of throughput and fairness. As LCA depends on information collected at runtime by existing monitoring mechanisms of modern processors, it can be easily incorporated in real-life co-scheduling scenarios with various application features and platform configurations.

International Journal of Numerical Methods for Heat & Fluid Flow | 2006

The computation of flow and heat transfer through an orthogonally rotating square-ended U-bend, using low-Reynolds-number models

Konstantinos Nikas; Hector Iacovides

Purpose – To assess how effectively two‐layer and low‐Reynolds‐number models of turbulence, at effective viscosity and second‐moment closure level, can predict the flow and thermal development through orthogonally rotating U‐bends.Design/methodology/approach – Heat and fluid flow computations through a square‐ended U‐bend that rotates about an axis normal to both the main flow direction and also the axis of curvature have been carried out. Two‐layer and low‐Reynolds‐number mathematical models of turbulence are used at effective viscosity (EVM) level and also at second‐moment‐closure (DSM) level. In the two‐layer models the dissipation rate of turbulence in the new‐wall regions is obtained from the wall distance, while in the low‐Re models the transport equation for the dissipation rate is extended right up to the walls. Moreover, two length‐scale correction terms to the dissipation rate of turbulence are used with the low‐Re models, and original Yap term and a differential form that does not require the w...

international parallel and distributed processing symposium | 2012

An Approach to Parallelize Kruskal's Algorithm Using Helper Threads

Anastasios Katsigiannis; Nikos Anastopoulos; Konstantinos Nikas; Nectarios Koziris

In this paper we present a Helper Threading scheme used to parallelize efficiently Kruskals Minimum Spanning Forest algorithm. This algorithm is known for exhibiting inherently sequential characteristics. More specifically, the strict order by which the algorithm checks the edges of a given graph is the main reason behind the lack of explicit parallelism. Our proposed scheme attempts to overcome the imposed restrictions and improve the performance of the algorithm. The results show that for a wide range of graphs of varying structure, size and density the parallelization of Kruskals algorithm is feasible. Observed speedups reach up to 5.5 for 8 running threads, revealing the potentials of our approach.

parallel, distributed and network-based processing | 2016

Massively Concurrent Red-Black Trees with Hardware Transactional Memory

Dimitris Siakavaras; Konstantinos Nikas; Georgios I. Goumas; Nectarios Koziris

Hardware Transactional Memory (HTM) is nowadays available in several commercial and HPC targeted processors and in the future it will likely be available on systems that can accommodate a very large number of threads. Thus, it is essential for the research community to target on evaluating HTM on as many cores as possible in order to understand the virtues and limitations that come with it. In this paper we utilize HTM to parallelize accesses on a classic data structure, a red-black tree. With minimal programming effort, we implement a red-black tree by enclosing each operation in a single HTM transaction and evaluate it on two servers equipped with Intel Haswell-EP and IBM Power8 processors, supporting a large number of hardware threads, namely 56 and 160 respectively. Our evaluation reveals that applying HTM in such a simplistic manner allows scalability for up to a limited number of hardware threads. To fully utilize the underlying hardware we apply different optimizations on each platform.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

Energy-Efficient Sparse Matrix Autotuning with CSX -- A Trade-off Study

Jan Christian Meyer; Juan M. Cebrian; Lasse Natvig; Vasileios Karakasis; Dimitris Siakavaras; Konstantinos Nikas

In this paper, we apply a method for extracting a running power estimate of applications from hardware performance counters, producing power/time curves which can be integrated over particular intervals to estimate the energy consumption of individual application stages. We use this method to instrument executions of a conjugate gradient solver, to examine the energy and performance impacts of applying the Compressed Sparse eXtended (CSX) and classic Compressed Sparse Row (CSR) matrix compression methods to sparse linear systems from different application areas. The CSX format requires a preprocessing stage which identifies and exploits a range of matrix substructures, incurring a one-time cost which can facilitate more effective sparse matrix-vector multiplication (SpMV). As this numerical kernel is the primary performance bottleneck of conjugate gradient solvers, we take the approach of isolating the energy cost of preprocessing from a short sample of application iterations, obtaining measurements which enlighten the choice of which compression scheme is more appropriate to the input data. We examine the impact variable degrees of parallelism, processor clock frequency, and Hyper threading have on this trade-off. Our results include comparisons of empirically obtained results from all combinations of up to 8 threads on 4 hyper threaded cores, 3 clock frequencies, and 5 sample application matrices. We assess program-hardware interactions with views to structural properties of the data and hardware architectural features, and evaluate the approach with respect to integrating the energy instrumentation with present automatic performance tuning. Results show that our method is sufficiently precise to identify non-trivial tradeoffs in the parameter space, and may become suitable for a run-time automatic tuning scheme by applying a faster preprocessing mode of CSX.

international symposium on circuits and systems | 2017

An efficient and fair scheduling policy for multiprocessor platforms

Theodoros Marinakis; Alexandros-Herodotos Haritatos; Konstantinos Nikas; Georgios I. Goumas; Iraklis Anagnostopoulos

Scheduling is a decision-making process that deals with the assignment of resources to tasks over given periods, aiming to optimize one or more objectives. Responsible for efficient distribution of the CPU time among the processes, scheduler has become an essential part of computer systems. While applications run on neighboring cores of a many-core system, they compete with each other for the shared resources (cache, memory etc.). This contention can result in great performance degradation for the applications that are concurrently executed. For this reason, treating the cores of a many-core systems as isolated and independent units is a very optimistic abstraction and can cause great problems to the objectives a scheduler tries to optimize. This paper presents a scheduler that focuses on improving the systems fairness by deciding the group of applications that will be executed together based on the progress they have performed. Results shows that the proposed scheduler achieves on average 86% fairness improvement compared to two state-of-art schedulers.

parallel computing | 2012

Using state-of-the-art sparse matrix optimizations for accelerating the performance of multiphysics simulations

Vasileios Karakasis; Georgios I. Goumas; Konstantinos Nikas; Nectarios Koziris; Juha Ruokolainen; Peter Råback

Multiphysics simulations are at the core of modern Computer Aided Engineering (CAE) allowing the analysis of multiple, simultaneously acting physical phenomena. These simulations often rely on Finite Element Methods (FEM) and the solution of large linear systems which, in turn, end up in multiple calls of the costly Sparse Matrix-Vector Multiplication (SpM×V) kernel. The major--and mostly inherent--performance problem of the this kernel is its very low flop:byte ratio, meaning that the algorithm must retrieve a significant amount of data from the memory hierarchy in order to perform a useful operation.

international conference on parallel processing | 2009