Christos D. Antonopoulos

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christos D. Antonopoulos is active.

Explore More

Publication

Featured researches published by Christos D. Antonopoulos.

signal processing systems | 2007

Exploring New Search Algorithms and Hardware for Phylogenetics: RAxML Meets the IBM Cell

Alexandros Stamatakis; Filip Blagojevic; Dimitrios S. Nikolopoulos; Christos D. Antonopoulos

Phylogenetic inference is considered to be one of the grand challenges in Bioinformatics due to the immense computational requirements. RAxML is currently among the fastest and most accurate programs for phylogenetic tree inference under the Maximum Likelihood (ML) criterion. First, we introduce new tree search heuristics that accelerate RAxML by a factor of 2.43 while returning equally good trees. The performance of the new search algorithm has been assessed on 18 real-world datasets comprising 148 up to 4,843 DNA sequences. We then present the implementation, optimization, and evaluation of RAxML on the IBM Cell Broadband Engine. We address the problems and provide solutions pertaining to the optimization of floating point code, control flow, communication, and scheduling of multi-level parallelism on the Cell.

international conference on supercomputing | 2006

Online power-performance adaptation of multithreaded programs using hardware event-based prediction

Matthew Curtis-Maury; James Dzierwa; Christos D. Antonopoulos; Dimitrios S. Nikolopoulos

With high-end systems featuring multicore/multithreaded processors and high component density, power-aware high-performance multithreading libraries become a critical element of the system software stack. Online power and performance adaptation of multithreaded code from within user-level runtime libraries is a relatively new and unexplored area of research. We present a user-level library framework for nearly optimal online adaptation of multithreaded codes for low-power, high-performance execution. Our framework operates by regulating concurrency and changing the processors/threads configuration as the program executes. It is innovative in that it uses fast, runtime performance prediction derived from hardware event-driven profiling, to select thread granularities that achieve nearly optimal energy-efficiency points. The use of predictors substantially reduces the runtime cost of granularity control and program adaptation. Our framework achieves performance and ED2 (energy-delay-squared) levels which are: i) comparable to or better than those of oracle-derived offline predictors; ii) significantly better than those of online predictors using exhaustive or localized linear search. The complete prediction and adaptation framework is implemented on a real multi-SMT system with Intel Hyperthreaded processors and embeds adaptation capabilities in OpenMP programs.

field-programmable custom computing machines | 2011

Synthesis of Platform Architectures from OpenCL Programs

Muhsen Owaida; Nikolaos Bellas; Konstantis Daloukas; Christos D. Antonopoulos

The problem of automatically generating hardware modules from a high level representation of an application has been at the research forefront in the last few years. In this paper, we use OpenCL, an industry supported standard for writing programs that execute on multicore platforms and accelerators such as GPUs. Our architectural synthesis tool, SOpenCL (Silicon-OpenCL), adapts OpenCL into a novel hardware design flow which efficiently maps coarse and fine-grained parallelism of an application onto an FPGA reconfigurable fabric. SOpenCL is based on a source-to-source code transformation step that coarsens the OpenCL fine-grained parallelism into a series of nested loops, and on a template-based hardware generation back-end that configures the accelerator based on the functionality and the application performance and area requirements. Our experimentation with a variety of OpenCL and C kernel benchmarks reveals that area, throughput and frequency optimized hardware implementations are attainable using SOpenCL.

IEEE Transactions on Parallel and Distributed Systems | 2008

Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes

Matthew Curtis-Maury; Filip Blagojevic; Christos D. Antonopoulos; Dimitrios S. Nikolopoulos

Computing has recently reached an inflection point with the introduction of multi-core processors. On-chip thread-level parallelism is doubling approximately every other year. Concurrency lends itself naturally to allowing a program to trade performance for power savings by regulating the number of active cores, however in several domains users are unwilling to sacrifice performance to save power. We present a prediction model for identifying energy-efficient operating points of concurrency in well-tuned multithreaded scientific applications, and a runtime system which uses live program analysis to optimize applications dynamically. We describe a dynamic, phase-aware performance prediction model that combines multivariate regression techniques with runtime analysis of data collected from hardware event counters to locate optimal operating points of concurrency. Using our model, we develop a prediction-driven, phase-aware runtime optimization scheme that throttles concurrency so that power consumption can be reduced and performance can be set at the knee of the scalability curve of each program phase. The use of prediction reduces the overhead of searching the optimization space while achieving near-optimal performance and power savings. A thorough evaluation of our approach shows a reduction in power consumption of 10.8% simultaneous with an improvement in performance of 17.9%, resulting in energy savings of 26.7%.

acm sigplan symposium on principles and practice of parallel programming | 2007

Dynamic multigrain parallelization on the cell broadband engine

Filip Blagojevic; Dimitrios S. Nikolopoulos; Alexandros Stamatakis; Christos D. Antonopoulos

This paper addresses the problem of orchestrating and scheduling parallelism at multiple levels of granularity on heterogeneous multicore processors. We present mechanisms and policies for adaptive exploitation and scheduling of layered parallelism on the Cell Broadband Engine. Our policies combine event-driven task scheduling with malleable loop-level parallelism, which is exploited from the runtime system whenever task-level parallelism leaves idle cores. We present a scheduler for applications with layered parallelism on Cell and investigate its performance with RAxML, an application which infers large phylogenetic trees, using the Maximum Likelihood (ML) method. Our experiments show that the Cell benefits significantly from dynamic methods that selectively exploit the layers of parallelism in the system, in response to workload fluctuation. Our scheduler out performs the MPI version of RAxML, scheduled by the Linux kernel, by up to a factor of 2.6. We are able to execute RAxMLon one Cell four times faster than on a dual-processor system with Hyperthreaded Xeon processors, and 5--10% faster than on a single-processor system with a dual-core, quad-thread IBM Power5processor.

international symposium on memory management | 2006

Scalable locality-conscious multithreaded memory allocation

Scott Schneider; Christos D. Antonopoulos; Dimitrios S. Nikolopoulos

We present Streamflow, a new multithreaded memory manager designed for low overhead, high-performance memory allocation while transparently favoring locality. Streamflow enables low over-head simultaneous allocation by multiple threads and adapts to sequential allocation at speeds comparable to that of custom sequential allocators. It favors the transparent exploitation of temporal and spatial object access locality, and reduces allocator-induced cache conflicts and false sharing, all using a unified design based on segregated heaps. Streamflow introduces an innovative design which uses only synchronization-free operations in the most common case of local allocations and deallocations, while requiring minimal, non-blocking synchronization in the less common case of remote deallocations. Spatial locality at the cache and page level is favoredby eliminating small objects headers, reducing allocator-induced conflicts via contiguous allocation of page blocks in physical memory, reducing allocator-induced false sharing by using segregated heaps and achieving better TLB performance and fewer page faults via the use of superpages. Combining these locality optimizations with the drastic reduction of synchronization and latency overhead allows Streamflow to perform comparably with optimized sequential allocators and outperform--on a shared-memory systemwith four two-way SMT processors--four state-of-the-art multi-processor allocators by sizeable margins in our experiments. The allocation-intensive sequential and parallel benchmarks used in our experiments represent a variety of behaviors, including mostly local object allocation-deallocation patterns and producer-consumer allocation-deallocation patterns.

international parallel and distributed processing symposium | 2007

RAxML-Cell: Parallel Phylogenetic Tree Inference on the Cell Broadband Engine

Filip Blagojevic; Alexandros Stamatakis; Christos D. Antonopoulos; Dimitrios S. Nikolopoulos

Computational phylogeny is a challenging application even for the most powerful supercomputers. It is also an ideal candidate for benchmarking emerging multiprocessor architectures, because it exhibits fine- and coarse-grain parallelism at multiple levels. In this paper, we present the porting, optimization, and evaluation of RAxML on the cell broadband engine. RAxML is a provably efficient, hill climbing algorithm for computing phylogenetic trees, based on the maximum likelihood (ML) method. The cell broadband engine, a heterogeneous multi-core processor with SIMD accelerators which was initially marketed for set-top boxes, is currently being deployed on supercomputers and high-end server architectures. We present both conventional and unconventional, cell-specific optimizations for RAxMLs search algorithm on a real cell multiprocessor. While exploring these optimizations, we present solutions to problems related to floating point code execution, complex control flow, communication, scheduling, and multilevel parallelization on the cell.

international workshop on openmp | 2005

An evaluation of OpenMP on current and emerging multithreaded/multicore processors

Matthew Curtis-Maury; Xiaoning Ding; Christos D. Antonopoulos; Dimitrios S. Nikolopoulos

Multiprocessors based on simultaneous multithreaded (SMT) or multicore (CMP) processors are continuing to gain a significant share in both high-performance and mainstream computing markets. In this paper we evaluate the performance of OpenMP applications on these two parallel architectures. We use detailed hardware metrics to identify architectural bottlenecks. We find that the high level of resource sharing in SMTs results in performance complications, should more than 1 thread be assigned on a single physical processor. CMPs, on the other hand, are an attractive alternative. Our results show that the exploitation of the multiple processor cores on each chip results in significant performance benefits. We evaluate an adaptive, run-time mechanism which provides limited performance improvements on SMTs, however the inherent bottlenecks remain difficult to overcome. We conclude that out-of-the-box OpenMP code scales better on CMPs than SMTs. To maximize the efficiency of OpenMP on SMTs, new capabilities are required by the runtime environment and/or the programming interface.

international conference on parallel processing | 2003

Scheduling algorithms with bus bandwidth considerations for SMPs

Christos D. Antonopoulos; Dimitrios S. Nikolopoulos; Theodore S. Papatheodorou

The bus that connects processors to memory is known to be a major architectural bottleneck in SMPs. However, both software and scheduling policies for these systems generally focus on memory hierarchy optimizations and do not address the bus bandwidth limitations directly. We first present experimental results which indicate that bus saturation can cause an up to almost three-fold slowdown to applications. Motivated by these results, we introduce two scheduling policies that take into account the bus bandwidth consumption of applications. The necessary information is provided by performance monitoring counters which are present in all modern processors. Our algorithms organize jobs so that processes with high-bandwidth and low-bandwidth demands are co-scheduled to improve bus bandwidth utilization without saturating the bus. We found that our scheduler is effective with applications of varying bandwidth requirements, from very low to close to the limit of saturation. We also tuned our scheduler for robustness in the presence of bursts of high bus bandwidth consumption from individual jobs. The new scheduling policies improve system throughput by up to 68% (26% in average) in comparison with the standard Linux scheduler

international parallel and distributed processing symposium | 2005

Scheduling algorithms for effective thread pairing on hybrid multiprocessors

Robert L. McGregor; Christos D. Antonopoulos; Dimitrios S. Nikolopoulos

With the latest high-end computing nodes combining shared-memory multiprocessing with hardware multithreading, new scheduling policies are necessary for workloads consisting of multithreaded applications. The use of hybrid multiprocessors presents schedulers with the problem of job pairing, i.e. deciding which specific jobs can share each processor with minimum performance penalty, by running on different execution contexts. Therefore, scheduling policies are expected to decide not only which job mix will execute simultaneously across the processors, but also which jobs can be combined within each processor. This paper addresses the problem by introducing new scheduling policies that use run-time performance information to identify the best mix of threads to run across processors and within each processor. Scheduling of threads across processors is driven by the memory bandwidth utilization of the threads, whereas scheduling of threads within processors is driven by one of three metrics: bus transaction rate per thread, stall cycle rate per thread, or outermost level cache miss rate per thread. We have implemented and experimentally evaluated these policies on a real multiprocessor server with Intel Hyperthreaded processors. The policy using bus transaction rate for thread pairing achieves an average 13.4% and a maximum 28.7% performance improvement over the Linux scheduler. The policy using stall cycle rate for thread pairing achieves an average 9.5% and a maximum 18.8% performance improvement. The average and maximum performance gains of the policy using cache miss rate for thread pairing are 7.2% and 23.6% respectively.

Explore More