Christos Kotselidis | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christos Kotselidis is active.

Explore More

Publication

Featured researches published by Christos Kotselidis.

international conference on algorithms and architectures for parallel processing | 2008

Lee-TM: A Non-trivial Benchmark Suite for Transactional Memory

Mohammad Ansari; Christos Kotselidis; Ian Watson; Chris C. Kirkham; Mikel Luján; Kim Jarvis

Transactional Memory (TM) is a concurrent programming paradigm that aims to make concurrent programming easier than fine-grain locking, whilst providing similar performance and scalability. Several TM systems have been made available for research purposes. However, there is a lack of a wide range of non-trivial benchmarks with which to thoroughly evaluate these TM systems. This paper introduces Lee-TM, a non-trivial and realistic TM benchmark suite based on Lees routing algorithm. The benchmark suite provides sequential, lock-based, and transactional implementations to enable direct performance comparison. Lees routing algorithm has several of the desirable properties of a non-trivial TM benchmark, such as large amounts of parallelism, complex contention characteristics, and a wide range of transaction durations and lengths. A sample evaluation shows unfavourable transactional performance and scalability compared to lock-based execution, in contrast to much of the published TM evaluations, and highlights the need for non-trivial TM benchmarks.

international parallel and distributed processing symposium | 2010

Clustering JVMs with software transactional memory support

Christos Kotselidis; Mikel Luján; Mohammad Ansari; Konstantinos Malakasis; Behram Kahn; Chris C. Kirkham; Ian Watson

Affordable transparent clustering solutions to scale non-HPC applications on commodity clusters (such as Terracotta) are emerging for Java Virtual Machines (JVMs). Working in this direction, we propose the Anaconda framework as a research platform to investigate the role Transactional Memory (TM) can play in this domain. Anaconda is a software transactional memory framework that supports clustering of multiple off-the-shelf JVMs on commodity clusters. The main focus of Anaconda is to investigate the implementation of Java synchronization primitives on clusters by relying on Transactional Memory. The traditional lock based Java primitives are replaced by memory transactions and the framework is responsible for ensuring transactional coherence. The contribution of this paper is to investigate which kind of TM coherency protocol can be used in this domain and compare the Anaconda framework against the state-of-the-art Terracotta clustering technology. Furthermore, Anaconda tracks TM conflicts at object granularity and provides distributed object replication and caching mechanisms. It supports existing TM coherence protocols while adding a novel decentralized protocol. The performance evaluation compares Anaconda against three existing TM protocols. Two of these are centralized, while the other is decentralized. In addition, we compare Anaconda against lock-based (coarse, medium grain) implementations of the benchmarks running on Terracotta. Anacondas performance varies amongst benchmarks, outperforming by 40 to 70% existing TM protocols. Compared to Terracotta, Anaconda exhibits from 19x speedup to 10x slowdown depending on the benchmarks characteristics.

automation, robotics and control systems | 2017

Boosting Java Performance Using GPGPUs

James Clarkson; Christos Kotselidis; Gavin Brown; Mikel Luján

In this paper we describe Jacc, an experimental framework which allows developers to program GPGPUs directly from Java. The goal of Jacc, is to allow developers to benefit from using heterogeneous hardware whilst minimizing the amount of code refactoring required. Jacc utilizes two key abstractions: tasks which encapsulate all the information needed to execute code on a GPGPU; and task graphs which capture both inter-task control-flow and data dependencies. These abstractions enable the Jacc runtime system to automatically choreograph data movement and synchronization between the host and the GPGPU; eliminating the need to explicitly manage disparate memory spaces. We demonstrate the advantages of Jacc, both in terms of programmability and performance, by evaluating it against existing Java frameworks. Experimental results show an average performance speedup of 19x, using NVIDIA Tesla K20m GPU, and a 4x decrease in code complexity when compared with writing multi-threaded Java code across eight evaluated benchmarks.

international conference on parallel architectures and compilation techniques | 2016

Integrating Algorithmic Parameters into Benchmarking and Design Space Exploration in 3D Scene Understanding

Bruno Bodin; Luigi Nardi; M. Zeeshan Zia; Harry Wagstaff; Govind Sreekar Shenoy; Murali Emani; John Mawer; Christos Kotselidis; Andy Nisbet; Mikel Luján; Björn Franke; Paul H. J. Kelly; Michael F. P. O'Boyle

System designers typically use well-studied benchmarks to evaluate and improve new architectures and compilers. We design tomorrows systems based on yesterdays applications. In this paper we investigate an emerging application, 3D scene understanding, likely to be significant in the mobile space in the near future. Until now, this application could only run in real-time on desktop GPUs. In this work, we examine how it can be mapped to power constrained embedded systems. Key to our approach is the idea of incremental co-design exploration, where optimization choices that concern the domain layer are incrementally explored together with low-level compiler and architecture choices. The goal of this exploration is to reduce execution time while minimizing power and meeting our quality of result objective. As the design space is too large to exhaustively evaluate, we use active learning based on a random forest predictor to find good designs. We show that our approach can, for the first time, achieve dense 3D mapping and tracking in the real-time range within a 1W power budget on a popular embedded device. This is a 4.8× execution time improvement and a 2.8× power reduction compared to the state-of-the-art.

virtual execution environments | 2017

Heterogeneous Managed Runtime Systems: A Computer Vision Case Study

Christos Kotselidis; James Clarkson; Andrey Rodchenko; Andy Nisbet; John Mawer; Mikel Luján

Real-time 3D space understanding is becoming prevalent across a wide range of applications and hardware platforms. To meet the desired Quality of Service (QoS), computer vision applications tend to be heavily parallelized and exploit any available hardware accelerators. Current approaches to achieving real-time computer vision, evolve around programming languages typically associated with High Performance Computing along with binding extensions for OpenCL or CUDA execution. Such implementations, although high performing, lack portability across the wide range of diverse hardware resources and accelerators. In this paper, we showcase how a complex computer vision application can be implemented within a managed runtime system. We discuss the complexities of achieving high-performing and portable execution across embedded and desktop configurations. Furthermore, we demonstrate that it is possible to achieve the QoS target of over 30 frames per second (FPS) by exploiting FPGA and GPGPU acceleration transparently through the managed runtime system.

high performance embedded architectures and compilers | 2010

Improving performance by reducing aborts in hardware transactional memory

Mohammad Ansari; Behram Khan; Mikel Luján; Christos Kotselidis; Chris C. Kirkham; Ian Watson

The optimistic nature of Transactional Memory (TM) systems can lead to the concurrent execution of transactions that are later found to conflict. Conflicts degrade scalability, and may lead to aborts that increase wasted work, and degrade performance. A promising approach to reducing conflicts at runtime is dynamically, and transparently, reordering the execution of transactions upon discovery of conflicts. This approach has been explored in Software TMs (STMs), but not in Hardware TMs (HTMs). Furthermore, STM implementations of this approach cannot be ported to HTMs easily. This paper investigates the feasibility of such reordering in HTMs, and presents two designs that are scalable, independent of the on-chip interconnect, require only minor modifications to each core, and add no execution overhead if no conflicts occur. The evaluation takes LogTM-SE as a base line and considers benchmarks with different levels of contention (transactional conflicts). The results show that the preferred design increases HTM performance by up to 17% when contention is low, 57% when contention is high, and never degrades performance. Finally, the designs are orthogonal to LogTM-SE; they require no modification to cache structures, and continue to support transaction virtualization, open and closed unbounded nesting, paging, thread suspension, and thread migration.

acm sigplan symposium on principles and practice of parallel programming | 2008

Experiences using adaptive concurrency in transactional memory with Lee's routing algorithm

Mohammad Ansari; Christos Kotselidis; Kim Jarvis; Mikel Luján; Chris C. Kirkham; Ian Watson

Experience in profiling Lees routing algorithm, a new complex TM application, showed that transactional applications may exhibit dynamic exploitable parallelism, i.e. the amount of useful parallelism available at any point in time varies during the execution of the application. Obviously, executing too many transactions at times when the available parallelism is low will lead to high contention and wasted computation in aborted transactions, and vice versa. Current Transactional Memory (TM) implementations do not account for this behavior. This work employs adaptive concurrency to dynamically adjust the number of threads executing transactions concurrently. Our preliminary evaluation is performed in DSTM2 using Lees routing algorithm, both of which were simple to modify to enable adaptive concurrency, and shows significant reduction in resource usage, and modest performance gains.

international symposium on performance analysis of systems and software | 2017

MaxSim: A simulation platform for managed applications

Andrey Rodchenko; Christos Kotselidis; Andy Nisbet; Antoniu Pop; Mikel Luján

Managed applications, written in programming languages such as Java, C# and others, represent a significant share of workloads in the mobile, desktop, and server domains. Microarchitectural timing simulation of such workloads is useful for characterization and performance analysis, of both hardware and software, as well as for research and development of novel hardware extensions. This paper introduces MaxSim, a simulation platform based on the Maxine VM, the ZSim simulator, and the McPAT modeling framework. MaxSim is able to simulate fast and accurately managed workloads running on top of Maxine VM and its capabilities are showcased with novel simulation techniques for: 1) low-intrusive microarchitectural profiling via pointer tagging on the x86-64 platforms, 2) modeling of hardware extensions related, but not limited to, tagged pointers, and 3) modeling of complex software changes via address-space morphing. Low-intrusive microarchitectural profiling is achieved by utilizing tagged pointers to collect type- and allocation-site-related hardware events. Furthermore, MaxSim allows, through a novel technique called address space morphing, the easy modeling of complex object layout transformations. Finally, through the codesigned capabilities of MaxSim, novel hardware extensions can be implemented and evaluated. We showcase MaxSims capabilities by simulating the whole set of the DaCapo-9.12-bach benchmarks in less than a day while performing an up-to-date microarchitectural power and performance characterization. Furthermore, we demonstrate a hardware/software co-designed optimization that performs dynamic load elimination for array length retrieval achieving up to 14% L1 data cache loads reduction and up to 4% dynamic energy reduction. MaxSim is available at https://github.com/arodchen/MaxSim released as free software.

IEEE Transactions on Computers | 2018

Type Information Elimination from Objects on Architectures with Tagged Pointers Support

Andrey Rodchenko; Christos Kotselidis; Andy Nisbet; Antoniu Pop; Mikel Luján

Implementations of object-oriented programming languages associate type information with each object to perform various runtime tasks such as dynamic dispatch, type introspection, and reflection. A common means of storing such relation is by inserting a pointer to the associated type information into every object. Such an approach, however, introduces memory and performance overheads when compared with non-object-oriented languages. Recent 64-bit computer architectures have added support for tagged pointers by ignoring a number of bits - tag - of memory addresses during memory access operations and utilize them for other purposes; mainly security. This paper presents the first investigation into how this hardware support can be exploited by a Java Virtual Machine to remove type information from objects. Moreover, we propose novel hardware extensions to the address generation and load-store units to achieve low-overhead type information retrieval and tagged object pointers compression-decompression. The evaluation has been conducted after integrating the Maxine VM and the ZSim microarchitectural simulator. The results, across all the DaCapo benchmark suite, pseudo-SPECjbb2005, SLAMBench and GraphChi-PR executed to completion, show up to 26 and 10 percent geometric mean heap space savings, up to 50 and 12 percent geometric mean dynamic DRAM energy reduction, and up to 49 and 3 percent geometric mean execution time reduction with no significant performance regressions.

Proceedings of the 15th International Conference on Managed Languages & Runtimes | 2018

Exploiting high-performance heterogeneous hardware for Java programs using graal

James Clarkson; Juan Fumero; Michail Papadimitriou; Foivos S. Zakkak; Maria Xekalaki; Christos Kotselidis; Mikel Luján

The proliferation of heterogeneous hardware in recent years means that every system we program is likely to include a mix of compute elements; each with different characteristics. By utilizing these available hardware resources, developers can improve the performance and energy efficiency of their applications. However, existing tools for heterogeneous programming neglect developers who do not have the time or inclination to switch programming languages or learn the intricacies of a specific piece of hardware. This paper presents a framework that enables Java applications to be deployed across a variety of heterogeneous systems while exploiting any available multi- or many-core processor. The novel aspect of our approach is that it does not require any a priori knowledge of the hardware, or for the developer to worry about managing disparate memory spaces. Java applications are transparently compiled and optimized for the hardware at run-time. We also present a performance evaluation of our just-in-time (JIT) compiler using a framework to accelerate SLAM, a complex computer vision application entirely written in Java. We show that we can accelerate SLAM up to 150x compared to the Java reference implementation, rendering 107 frames per second (FPS).

Explore More