Costas Kyriacou | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Costas Kyriacou is active.

Explore More

Publication

Featured researches published by Costas Kyriacou.

IEEE Transactions on Parallel and Distributed Systems | 2006

Data-Driven Multithreading Using Conventional Microprocessors

Costas Kyriacou; Paraskevas Evripidou; Pedro Trancoso

This paper describes the data-driven multithreading (DDM) model and how it may be implemented using off-the-shelf microprocessors. Data-driven multithreading is a nonblocking multithreading execution model that tolerates internode latency by scheduling threads for execution based on data availability. Scheduling based on data availability can be used to exploit cache management policies that reduce significantly cache misses. Such policies include firing a thread for execution only if its data is already placed in the cache. We call this cache management policy the CacheFlow policy. The core of the DDM implementation presented is a memory mapped hardware module that is attached directly to the processors bus. This module is responsible for thread scheduling and is known as the thread synchronization unit (TSU). The evaluation of DDM was performed using simulation of the data-driven network of workstations (D2NOW). D2NOW is a DDM implementation built out of regular workstations augmented with the TSU. The simulation was performed for nine scientific applications, seven of which belong to the SPLASH-2 suite. The results show that DDM can tolerate well both the communication and synchronization latency. Overall, for 16 and 32-node D2NOW machines the speedup observed was 14.4 and 26.0, respectively

european conference on parallel processing | 2004

CacheFlow: A short-term optimal cache management policy for Data Driven Multithreading

Costas Kyriacou; Paraskevas Evripidou; Pedro Trancoso

With Data Driven Multithreading a thread is scheduled for execution only if all of its inputs have been produced and placed in the processor’s local memory. Scheduling based on data availability may be used to exploit short-term optimal cache management policies. Such policies include firing a thread for execution only if its code and data are already placed in the cache. Furthermore, blocks associated to threads scheduled for execution in the near future, are not replaced until the thread starts its execution. We call this short-term optimal cache management policy the CacheFlow policy.

International Journal of High Performance Systems Architecture | 2007

Chip multiprocessor based on data-driven multithreading model

Kyriakos Stavrou; Costas Kyriacou; Paraskevas Evripidou; Pedro Trancoso

Although the dataflow model of execution, with its obvious benefits, has been proposed for a long time, it has not yet been successfully exploited. Nevertheless, as traditional systems have recently started to reach their limits in delivering higher performance, new models of execution that use dataflow-like concepts are being studied. Among these, Data-Driven Multithreading (DDM) is a multithreading model that effectively hides the communication delay and synchronisation overheads. In DDM threads are scheduled as soon as their input data has been produced, that is, in a dataflow-like way. In addition to presenting a motivation to the dataflow model of execution, this paper also presents an overview of the DDM project. In particular, it focuses on the Chip Multiprocessor (CMP) implementation using the DDM model, its hardware, run-time system and performance evaluation. The DDM-CMP inherits the benefits of both the DDM model which allows to overcome the memory wall limitation and the CMP which offers a simpler design, higher degree of parallelism and larger power-performance efficiency, therefore overcoming the power wall. Preliminary experimental results show a significant benefit in terms of both speedup and power consumption, making the DDM-CMP architecture an attractive architecture for future processors.

International Journal of Parallel Programming | 2006

A case for chip multiprocessors based on the data-driven multithreading model

Pedro Trancoso; Paraskevas Evripidou; Kyriakos Stavrou; Costas Kyriacou

Current high-end microprocessors achieve high performance as a result of adding more features and therefore increasing complexity. This paper makes the case for a Chip-Multiprocessor based on the Data-Driven Multithreading (DDM-CMP) execution model in order to overcome the limitations of current design trends. Data-Driven Multithreading (DDM) is a multithreading model that effectively hides the communication delay and synchronization overheads. DDM-CMP avoids the complexity of other designs by combining simple commodity microprocessors with a small hardware overhead for thread scheduling and an interconnection network. Preliminary experimental results show that a DDM-CMP chip of the same hardware budget as a high-end commercial microprocessor, clocked at the same frequency, achieves a speedup of up to 18.5 with a 78–81% power consumption of the commercial chip. Overall, the estimated results for the proposed DDM-CMP architecture show a significant benefit in terms of both speedup and power consumption making it an attractive architecture for future processors.

2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing | 2011

Combining Compile and Run-Time Dependency Resolution in Data-Driven Multithreading

Samer Arandi; George Michael; Paraskevas Evripidou; Costas Kyriacou

Threaded Data-Flow systems schedule threads based on data-availability i.e. a thread can be scheduled for execution only after all its inputs have been generated by its producer threads. This requires that all data dependencies are resolved. Two approaches are typically utilized for resolving the dependencies:- Compile-time: which is efficient but cannot handle programs with run-time determined dependencies- Run-time: which can handle run-time determined dependencies but incurs run-time overheads even when part of the dependencies can be determined at compile-timeIn this work, we combine the two approaches. The compiler (or the programmer) attempts to resolve all the dependencies and encodes them into the Data-Flow dependency graph. For any unresolved dependency, it generates a helper thread that resolves the dependency at run-time and updates the graph accordingly with the help of I-Structures. Thus, it gains the benefits of both compile-time and run-time dependency resolution. This can be also utilized to improve the programmability, in the case where the programmer has to manually resolve data-dependencies, by deferring part of the dependency resolution to run-time.In this paper we describe our approach and present the implementation and evaluation on the Data-Driven Multithreading Virtual Machine (DDM-VM). The evaluation demonstrates that the overhead of the run-time dependency resolution can increase the execution time for small thread granularities, but it can be mostly eliminated when the thread granularity increases.

international conference on electronics, circuits, and systems | 2014

Low-cost fault-tolerant routing for regular topology NoCs

Konstantinos Tatas; S. Sawa; Costas Kyriacou

This paper presents a novel low-cost routing algorithm for regular (mesh) topology networks-on-chip. While deterministic NoC routing algorithms such as XY routing are still widely used, they can fail when a link or router in the NoC fails temporarily or permanently, because they provide no adaptivity. However, switching to a topology-agnostic routing algorithm can be a very costly approach in terms of performance and/or area. Instead, we propose switching the dimension order to YX instead of XY routing and back accordingly. Preliminary experimental results show that this approach maintains the simplicity of dimension-order routing and, therefore, its hardware efficiency, while greatly improving reachability. This approach can be combined with topology-agnostic approaches to reduce packet loss during algorithm reconfiguration time.

panhellenic conference on informatics | 2001

Communication assist for data driven multithreading

Costas Kyriacou; Paraskevas Evripidou

Latency tolerance is one of the main concerns in parallel processing. Data Driven Multithreading, a technique that uses extra hardware to schedule threads for execution based on data availability, allows for better performance, through latency tolerance. With Data Driven Multithreading a thread is scheduled for execution only if all of its inputs have been produced and placed in the processors local memory. Communication and synchronization are decoupled from the computation portions of a program, i.e. they execute asynchronously. Thus, no synchronization or communication latencies will be experienced. The processor can, though be idle when there are no threads ready for execution, Thus, communication latencies are difficult to hide completely in applications with high communication-to-computation characteristics. This paper presents three mechanisms for the implementation of the communication assist of a Data Driven Multithreaded architecture. The first mechanism relies only on fine grain communication, where each packet can transfer a single value. With the second mechanism, the communication assist is modified to support block data communication through the same fine grain interconnection network of the first configuration. The third mechanism employs a broadcast network such as Ethernet to transfer blocks of data, while fine grain communication is handled the same way as with the other two mechanisms.

DFM '13 Proceedings of the 2013 Data-Flow Execution Models for Extreme Scale Computing | 2013

Data-Flow vs Control-Flow for Extreme Level Computing

Paraskevas Evripidou; Costas Kyriacou

This paper challenges the current thinking for building High Performance Computing (HPC) Systems, which is currently based on the sequential computing also known as the von Neumann model, by proposing the use of Novel systems based on the Dynamic Data-Flow model of computation. The switch to Multi-core chips has brought the Parallel Processing into the mainstream. The computing industry and research community were forced to do this switch because they hit the Power and Memory walls. Will the same happen with HPC? The United States through its DARPA agency commissioned a study in 2007 to determine what kind of technologies will be needed to build an Exaflop computer. The head of the study was very pessimistic about the possibility of having an Exaflop computer in the foreseeable future. We believe that many of the findings that caused the pessimistic outlook were due to the limitations of the sequential model. A paradigm shift might be needed in order to achieve the affordable Exascale class Supercomputers.

Parallel Processing Letters | 2006

CacheFlow: Cache Optimizations for Data Driven Multithreading

Costas Kyriacou; Paraskevas Evripidou; Pedro Trancoso

Data-Driven Multithreading is a non-blocking multithreading model of execution that provides effective latency tolerance by allowing the computation processor do useful work, while a long latency event is in progress. With the Data-Driven Multithreading model, a thread is scheduled for execution only if all of its inputs have been produced and placed in the processors local memory. Data-driven sequencing leads to irregular memory access patterns that could affect negatively cache performance. Nevertheless, it enables the implementation of short-term optimal cache management policies. This paper presents the implementation of CacheFlow, an optimized cache management policy which eliminates the side effects due to the loss of locality caused by the data-driven sequencing, and reduces further cache misses. CacheFlow employs thread-based prefetching to preload data blocks of threads deemed executable. Simulation results, for nine scientific applications, on a 32-node Data-Driven Multithreaded machine show an average speedup improvement from 19.8 to 22.6. Two techniques to further improve the performance of CacheFlow, conflict avoidance and thread reordering, are proposed and tested. Simulation experiments have shown a speedup improvement of 24% and 32%, respectively. The average speedup for all applications on a 32-node machine with both optimizations is 26.1.

International Journal of Parallel Programming | 2018

Data-Driven Thread Execution on Heterogeneous Processors

Samer Arandi; George Matheou; Costas Kyriacou; Paraskevas Evripidou

In this paper we report our experience in implementing and evaluating the Data-Driven Multithreading (DDM) model on a heterogeneous multi-core processor. DDM is a non-blocking multithreading model that decouples the synchronization from the computation portions of a program, allowing them to execute asynchronously in a dataflow manner. Thread dependencies are determined by the compiler/programmer while thread scheduling is done dynamically at runtime based on data availability. The target processor for this implementation is the Cell processor. We call this implementation the Data-Driven Multithreading Virtual Machine for the Cell processor (DDM-

Explore More