Dimitrios S. Nikolopoulos

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dimitrios S. Nikolopoulos is active.

Explore More

Publication

Featured researches published by Dimitrios S. Nikolopoulos.

international conference on parallel architectures and compilation techniques | 2008

Prediction models for multi-dimensional power-performance optimization on many cores

Matthew Curtis-Maury; Ankur Shah; Filip Blagojevic; Dimitrios S. Nikolopoulos; Bronis R. de Supinski; Martin Schulz

Power has become a primary concern for HPC systems. Dynamic voltage and frequency scaling (DVFS) and dynamic concurrency throttling (DCT) are two software tools (or knobs) for reducing the dynamic power consumption of HPC systems. To date, few works have considered the synergistic integration of DVFS and DCT in performance-constrained systems, and, to the best of our knowledge, no prior research has developed application-aware simultaneous DVFS and DCT controllers in real systems and parallel programming frameworks. We present a multi-dimensional, online performance predictor, which we deploy to address the problem of simultaneous runtime optimization of DVFS and DCT on multi-core systems. We present results from an implementation of the predictor in a runtime library linked to the Intel OpenMP environment and running on an actual dual-processor quad-core system. We show that our predictor derives near-optimal settings of the power-aware program adaptation knobs that we consider. Our overall framework achieves significant reductions in energy (19% mean) and ED2 (40% mean), through simultaneous power savings (6% mean) and performance improvements (14% mean). We also find that our framework outperforms earlier solutions that adapt only DVFS or DCT, as well as one that sequentially applies DCT then DVFS. Further, our results indicate that prediction-based schemes for runtime adaptation compare favorably and typically improve upon heuristic search-based approaches in both performance and energy savings.

signal processing systems | 2007

Exploring New Search Algorithms and Hardware for Phylogenetics: RAxML Meets the IBM Cell

Alexandros Stamatakis; Filip Blagojevic; Dimitrios S. Nikolopoulos; Christos D. Antonopoulos

Phylogenetic inference is considered to be one of the grand challenges in Bioinformatics due to the immense computational requirements. RAxML is currently among the fastest and most accurate programs for phylogenetic tree inference under the Maximum Likelihood (ML) criterion. First, we introduce new tree search heuristics that accelerate RAxML by a factor of 2.43 while returning equally good trees. The performance of the new search algorithm has been assessed on 18 real-world datasets comprising 148 up to 4,843 DNA sequences. We then present the implementation, optimization, and evaluation of RAxML on the IBM Cell Broadband Engine. We address the problems and provide solutions pertaining to the optimization of floating point code, control flow, communication, and scheduling of multi-level parallelism on the Cell.

international conference on supercomputing | 2006

Online power-performance adaptation of multithreaded programs using hardware event-based prediction

Matthew Curtis-Maury; James Dzierwa; Christos D. Antonopoulos; Dimitrios S. Nikolopoulos

With high-end systems featuring multicore/multithreaded processors and high component density, power-aware high-performance multithreading libraries become a critical element of the system software stack. Online power and performance adaptation of multithreaded code from within user-level runtime libraries is a relatively new and unexplored area of research. We present a user-level library framework for nearly optimal online adaptation of multithreaded codes for low-power, high-performance execution. Our framework operates by regulating concurrency and changing the processors/threads configuration as the program executes. It is innovative in that it uses fast, runtime performance prediction derived from hardware event-driven profiling, to select thread granularities that achieve nearly optimal energy-efficiency points. The use of predictors substantially reduces the runtime cost of granularity control and program adaptation. Our framework achieves performance and ED2 (energy-delay-squared) levels which are: i) comparable to or better than those of oracle-derived offline predictors; ii) significantly better than those of online predictors using exhaustive or localized linear search. The complete prediction and adaptation framework is implemented on a real multi-SMT system with Intel Hyperthreaded processors and embeds adaptation capabilities in OpenMP programs.

international parallel and distributed processing symposium | 2010

Hybrid MPI/OpenMP power-aware computing

Dong Li; Bronis R. de Supinski; Martin Schulz; Kirk W. Cameron; Dimitrios S. Nikolopoulos

Power-aware execution of parallel programs is now a primary concern in large-scale HPC environments. Prior research in this area has explored models and algorithms based on dynamic voltage and frequency scaling (DVFS) and dynamic concurrency throttling (DCT) to achieve power-aware execution of programs written in a single programming model, typically MPI or OpenMP. However, hybrid programming models combining MPI and OpenMP are growing in popularity as emerging large-scale systems have many nodes with several processors per node and multiple cores per process or. In th is paper we present and evaluate solutions for power-efficient execution of programs written in this hybrid model targeting large-scale distributed systems with multicore nodes. We use a new power-aware performance prediction model of hybrid MPI/OpenMP applications to derive a novel algorithm for power-efficient execution of realis tic applications from th e ASCS equoia and N PB MZ bench marks. Our new algorithm yields substantial energy savings (4.18% on average and up to 13.8%) with either negligible performance loss or performance gain (up to 7.2%).

IEEE Transactions on Parallel and Distributed Systems | 2008

Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes

Matthew Curtis-Maury; Filip Blagojevic; Christos D. Antonopoulos; Dimitrios S. Nikolopoulos

Computing has recently reached an inflection point with the introduction of multi-core processors. On-chip thread-level parallelism is doubling approximately every other year. Concurrency lends itself naturally to allowing a program to trade performance for power savings by regulating the number of active cores, however in several domains users are unwilling to sacrifice performance to save power. We present a prediction model for identifying energy-efficient operating points of concurrency in well-tuned multithreaded scientific applications, and a runtime system which uses live program analysis to optimize applications dynamically. We describe a dynamic, phase-aware performance prediction model that combines multivariate regression techniques with runtime analysis of data collected from hardware event counters to locate optimal operating points of concurrency. Using our model, we develop a prediction-driven, phase-aware runtime optimization scheme that throttles concurrency so that power consumption can be reduced and performance can be set at the knee of the scalability curve of each program phase. The use of prediction reduces the overhead of searching the optimization space while achieving near-optimal performance and power savings. A thorough evaluation of our approach shows a reduction in power consumption of 10.8% simultaneous with an improvement in performance of 17.9%, resulting in energy savings of 26.7%.

acm sigplan symposium on principles and practice of parallel programming | 2007

Dynamic multigrain parallelization on the cell broadband engine

Filip Blagojevic; Dimitrios S. Nikolopoulos; Alexandros Stamatakis; Christos D. Antonopoulos

This paper addresses the problem of orchestrating and scheduling parallelism at multiple levels of granularity on heterogeneous multicore processors. We present mechanisms and policies for adaptive exploitation and scheduling of layered parallelism on the Cell Broadband Engine. Our policies combine event-driven task scheduling with malleable loop-level parallelism, which is exploited from the runtime system whenever task-level parallelism leaves idle cores. We present a scheduler for applications with layered parallelism on Cell and investigate its performance with RAxML, an application which infers large phylogenetic trees, using the Maximum Likelihood (ML) method. Our experiments show that the Cell benefits significantly from dynamic methods that selectively exploit the layers of parallelism in the system, in response to workload fluctuation. Our scheduler out performs the MPI version of RAxML, scheduled by the Linux kernel, by up to a factor of 2.6. We are able to execute RAxMLon one Cell four times faster than on a dual-processor system with Hyperthreaded Xeon processors, and 5--10% faster than on a single-processor system with a dual-core, quad-thread IBM Power5processor.

international symposium on memory management | 2006

Scalable locality-conscious multithreaded memory allocation

Scott Schneider; Christos D. Antonopoulos; Dimitrios S. Nikolopoulos

We present Streamflow, a new multithreaded memory manager designed for low overhead, high-performance memory allocation while transparently favoring locality. Streamflow enables low over-head simultaneous allocation by multiple threads and adapts to sequential allocation at speeds comparable to that of custom sequential allocators. It favors the transparent exploitation of temporal and spatial object access locality, and reduces allocator-induced cache conflicts and false sharing, all using a unified design based on segregated heaps. Streamflow introduces an innovative design which uses only synchronization-free operations in the most common case of local allocations and deallocations, while requiring minimal, non-blocking synchronization in the less common case of remote deallocations. Spatial locality at the cache and page level is favoredby eliminating small objects headers, reducing allocator-induced conflicts via contiguous allocation of page blocks in physical memory, reducing allocator-induced false sharing by using segregated heaps and achieving better TLB performance and fewer page faults via the use of superpages. Combining these locality optimizations with the drastic reduction of synchronization and latency overhead allows Streamflow to perform comparably with optimized sequential allocators and outperform--on a shared-memory systemwith four two-way SMT processors--four state-of-the-art multi-processor allocators by sizeable margins in our experiments. The allocation-intensive sequential and parallel benchmarks used in our experiments represent a variety of behaviors, including mostly local object allocation-deallocation patterns and producer-consumer allocation-deallocation patterns.

Operating Systems Review | 2009

Supporting MapReduce on large-scale asymmetric multi-core clusters

M. Mustafa Rafique; Benjamin Rose; Ali Raza Butt; Dimitrios S. Nikolopoulos

Asymmetric multi-core processors (AMPs) with general-purpose and specialized cores packaged on the same chip, are emerging as a leading paradigm for high-end computing. A large body of existing research explores the use of standalone AMPs in computationally challenging and data-intensive applications. AMPs are rapidly deployed as high-performance accelerators on clusters. In these settings, scheduling, communication and I/O are managed by generalpurpose processors (GPPs), while computation is off-loaded to AMPs. Design space exploration for the configuration and software stack of hybrid clusters of AMPs and GPPs is an open problem. In this paper, we explore this design space in an implementation of the popular MapReduce programming model. Our contributions are: An exploration of various design alternatives for hybrid asymmetric clusters of AMPs and GPPs; the adoption of a streaming approach to supporting MapReduce computations on clusters with asymmetric components; and adaptive schedulers that take into account individual component capabilities in asymmetric clusters. Throughout our design, we remove I/O bottlenecks, using double-buffering and asynchronous I/O. We present an evaluation of the design choices through experiments on a real cluster with MapReduce workloads of varying degrees of computation intensity. We find that in a cluster with resource-constrained and well-provisioned AMP accelerators, a streaming approach achieves 50.5% and 73.1% better performance compared to the non-streaming approach, respectively, and scales almost linearly with increasing number of compute nodes.We also show that our dynamic scheduling mechanisms adapt effectively the parameters of the scheduling policies between applications with different computation density.

international parallel and distributed processing symposium | 2007

RAxML-Cell: Parallel Phylogenetic Tree Inference on the Cell Broadband Engine

Filip Blagojevic; Alexandros Stamatakis; Christos D. Antonopoulos; Dimitrios S. Nikolopoulos

Computational phylogeny is a challenging application even for the most powerful supercomputers. It is also an ideal candidate for benchmarking emerging multiprocessor architectures, because it exhibits fine- and coarse-grain parallelism at multiple levels. In this paper, we present the porting, optimization, and evaluation of RAxML on the cell broadband engine. RAxML is a provably efficient, hill climbing algorithm for computing phylogenetic trees, based on the maximum likelihood (ML) method. The cell broadband engine, a heterogeneous multi-core processor with SIMD accelerators which was initially marketed for set-top boxes, is currently being deployed on supercomputers and high-end server architectures. We present both conventional and unconventional, cell-specific optimizations for RAxMLs search algorithm on a real cell multiprocessor. While exploring these optimizations, we present solutions to problems related to floating point code execution, complex control flow, communication, scheduling, and multilevel parallelization on the cell.

Archive | 2010

Recent Advances in the Message Passing Interface

Yiannis Cotronis; Anthony Danalis; Dimitrios S. Nikolopoulos; Jack J. Dongarra

Large Scale Systems.- A Scalable MPI_Comm_split Algorithm for Exascale Computing.- Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems.- Toward Performance Models of MPI Implementations for Understanding Application Scaling Issues.- PMI: A Scalable Parallel Process-Management Interface for Extreme-Scale Systems.- Run-Time Analysis and Instrumentation for Communication Overlap Potential.- Efficient MPI Support for Advanced Hybrid Programming Models.- Parallel Filesystems and I/O.- An HDF5 MPI Virtual File Driver for Parallel In-situ Post-processing.- Automated Tracing of I/O Stack.- MPI Datatype Marshalling: A Case Study in Datatype Equivalence.- Collective Operations.- Design of Kernel-Level Asynchronous Collective Communication.- Network Offloaded Hierarchical Collectives Using ConnectX-2s CORE-Direct Capabilities.- An In-Place Algorithm for Irregular All-to-All Communication with Limited Memory.- Applications.- Massively Parallel Finite Element Programming.- Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient Using MPI Datatypes.- Parallel Chaining Algorithms.- MPI Internals (I).- Precise Dynamic Analysis for Slack Elasticity: Adding Buffering without Adding Bugs.- Implementing MPI on Windows: Comparison with Common Approaches on Unix.- Compact and Efficient Implementation of the MPI Group Operations.- Characteristics of the Unexpected Message Queue of MPI Applications.- Fault Tolerance.- Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols.- Communication Target Selection for Replicated MPI Processes.- Transparent Redundant Computing with MPI.- Checkpoint/Restart-Enabled Parallel Debugging.- Best Paper Awards.- Load Balancing for Regular Meshes on SMPs with MPI.- Adaptive MPI Multirail Tuning for Non-uniform Input/Output Access.- Using Triggered Operations to Offload Collective Communication Operations.- MPI Internals (II).- Second-Order Algorithmic Differentiation by Source Transformation of MPI Code.- Locality and Topology Aware Intra-node Communication among Multicore CPUs.- Transparent Neutral Element Elimination in MPI Reduction Operations.- Poster Abstracts.- Use Case Evaluation of the Proposed MPIT Configuration and Performance Interface.- Two Algorithms of Irregular Scatter/Gather Operations for Heterogeneous Platforms.- Measuring Execution Times of Collective Communications in an Empirical Optimization Framework.- Dynamic Verification of Hybrid Programs.- Challenges and Issues of Supporting Task Parallelism in MPI.

Explore More