Costas Bekas
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Costas Bekas.
ieee international conference on high performance computing data and analytics | 2015
Johann Rudi; A. Cristiano I. Malossi; Tobin Isaac; Georg Stadler; Michael Gurnis; Peter W. J. Staar; Yves Ineichen; Costas Bekas; Alessandro Curioni; Omar Ghattas
Mantle convection is the fundamental physical process within earths interior responsible for the thermal and geological evolution of the planet, including plate tectonics. The mantle is modeled as a viscous, incompressible, non-Newtonian fluid. The wide range of spatial scales, extreme variability and anisotropy in material properties, and severely nonlinear rheology have made global mantle convection modeling with realistic parameters prohibitive. Here we present a new implicit solver that exhibits optimal algorithmic performance and is capable of extreme scaling for hard PDE problems, such as mantle convection. To maximize accuracy and minimize runtime, the solver incorporates a number of advances, including aggressive multi-octree adaptivity, mixed continuous-discontinuous discretization, arbitrarily-high-order accuracy, hybrid spectral/geometric/algebraic multigrid, and novel Schur-complement preconditioning. These features present enormous challenges for extreme scalability. We demonstrate that---contrary to conventional wisdom---algorithmically optimal implicit solvers can be designed that scale out to 1.5 million cores for severely nonlinear, ill-conditioned, heterogeneous, and anisotropic PDEs.
arXiv: Emerging Technologies | 2018
Manuel Le Gallo; Abu Sebastian; Roland Mathis; Matteo Manica; Heiner Giefers; Tomas Tuma; Costas Bekas; Alessandro Curioni; Evangelos Eleftheriou
As complementary metal–oxide–semiconductor (CMOS) scaling reaches its technological limits, a radical departure from traditional von Neumann systems, which involve separate processing and memory units, is needed in order to extend the performance of today’s computers substantially. In-memory computing is a promising approach in which nanoscale resistive memory devices, organized in a computational memory unit, are used for both processing and memory. However, to reach the numerical accuracy typically required for data analytics and scientific computing, limitations arising from device variability and non-ideal device characteristics need to be addressed. Here we introduce the concept of mixed-precision in-memory computing, which combines a von Neumann machine with a computational memory unit. In this hybrid system, the computational memory unit performs the bulk of a computational task, while the von Neumann machine implements a backward method to iteratively improve the accuracy of the solution. The system therefore benefits from both the high precision of digital computing and the energy/areal efficiency of in-memory computing. We experimentally demonstrate the efficacy of the approach by accurately solving systems of linear equations, in particular, a system of 5,000 equations using 998,752 phase-change memory devices.A hybrid system that combines a von Neumann machine with a computational memory unit can offer both the high precision of digital computing and the energy/areal efficiency of in-memory computing, which is illustrated by accurately solving a system of 5,000 equations using 998,752 phase-change memory devices.
network based information systems | 2015
Erika Ábrahám; Costas Bekas; Ivona Brandic; Samir Genaim; Einar Broch Johnsen; Ivan Kondov; Sabri Pllana; Achim Streit
While the HPC community is working towards the development of the first Exaflop computer (expected around 2020), after reaching the Petaflop milestone in 2008 still only few HPC applications are able to fully exploit the capabilities of Petaflop systems. In this paper we argue that efforts for preparing HPC applications for Exascale should start before such systems become available. We identify challenges that need to be addressed and recommend solutions in key areas of interest, including formal modeling, static analysis and optimization, runtime analysis and optimization, and autonomic computing. Furthermore, we outline a conceptual framework for porting HPC applications to future Exascale computing systems and propose steps for its implementation.
Computer Science - Research and Development | 2013
Yves Ineichen; Andreas Adelmann; Costas Bekas; Alessandro Curioni; Peter Arbenz
Particle accelerators are invaluable tools for research in the basic and applied sciences, in fields such as materials science, chemistry, the biosciences, particle physics, nuclear physics and medicine. The design, commissioning, and operation of accelerator facilities is a non-trivial task, due to the large number of control parameters and the complex interplay of several conflicting design goals.We propose to tackle this problem by means of multi-objective optimization algorithms which also facilitate massively parallel deployment. In order to compute solutions in a meaningful time frame, that can even admit online optimization, we require a fast and scalable software framework. In this paper, we focus on the key and most heavily used component of the optimization framework, the forward solver. We demonstrate that our parallel methods achieve a strong and weak scalability improvement of at least two orders of magnitude in today’s actual particle beam configurations, reducing total time to solution by a substantial factor.Our target platform is the Blue Gene/P (Blue Gene/P is a trademark of the International Business Machines Corporation in the United States, other countries, or both) supercomputer. The space-charge model used in the forward solver relies significantly on collective communication. Thus, the dedicated TREE network of the platform serves as an ideal vehicle for our purposes. We demonstrate excellent strong and weak scalability of our software which allows us to perform thousands of forward solves in a matter of minutes, thus already allowing close to online optimization capability.
international symposium on performance analysis of systems and software | 2016
Heiner Giefers; Peter W. J. Staar; Costas Bekas; Christoph Hagleitner
Hardware accelerators have evolved as the most prominent vehicle to meet the demanding performance and energy-efficiency constraints of modern computer systems. The prevalent type of hardware accelerators in the high-performance computing domain are PCIe attached co-processors to which the CPU can offload compute intensive tasks. In this paper, we analyze the performance, power, and energy-efficiency of such accelerators for sparse matrix multiplication kernels. Improving the efficiency for sparse matrix operations is of eminent importance since they work at the core of graph analytics algorithms which are in turn key to many big data knowledge discovery workloads. Our study involves GPU, Xeon Phi, and FPGA co-processors to embrace the vast majority of hardware accelerators applied in modern HPC systems. In order to compare the devices on the same level of implementation quality we apply vendor optimized libraries for which published results exist. From our experiments we deduce that none of the compared devices generally dominates in terms of energy-efficiency and that the optimal solutions depends on the actual sparse matrix data, data transfer requirements and on the applied efficiency metric. We also show that a combined use of multiple accelerators can further improve the systems performance and efficiency by up to 11% and 18%, respectively.
international parallel and distributed processing symposium | 2016
Stefan Eilemann; Fabien Delalondre; Jon Bernard; Judit Planas; Felix Schuermann; John Biddiscombe; Costas Bekas; Alessandro Curioni; Bernard Metzler; Peter Kaltstein; Peter Morjan; Joachim Fenkes; Ralph Bellofatto; Lars Schneidenbach; T. J. Christopher Ward; Blake G. Fitch
Scientific workflows are often composed of compute-intensive simulations and data-intensive analysis and visualization, both equally important for productivity. High-performance computers run the compute-intensive phases efficiently, but data-intensive processing is still getting less attention. Dense non-volatile memory integrated into super-computers can help address this problem. In addition to density, it offers significantly finer-grained I/O than disk-based I/O systems. We present a way to exploit the fundamental capabilities of Storage-Class Memories (SCM), such as Flash, by using scalable key-value (KV) I/O methods instead of traditional file I/O calls commonly used in HPC systems. Our objective is to enable higher performance for on-line and near-line storage for analysis and visualization of very high resolution, but correspondingly transient, simulation results. In this paper, we describe 1) the adaptation of a scalable key-value store to a BlueGene/Q system with integrated Flash memory, 2) a novel key-value aggregation module which implements coalesced, function-shipped calls between the clients and the servers, and 3) the refactoring of a scientific workflow to use application-relevant keys for fine-grained data subsets. The resulting implementation is analogous to function-shipping of POSIX I/O calls but shows an order of magnitude increase in read and a factor 2.5x increase in write IOPS performance (11 million read IOPS, 2.5 million write IOPS from 4096 compute nodes) when compared to a classical file system on the same system. It represents an innovative approach for the integration of SCM within an HPC system at scale.
international parallel and distributed processing symposium | 2016
Peter W. J. Staar; Panagiotis Kl. Barkoutsos; Roxana Istrate; A. Cristiano I. Malossi; Ivano Tavernelli; Nikolaj Moll; Heiner Giefers; Christoph Hagleitner; Costas Bekas; Alessandro Curioni
In this era of Big Data, large graphs appear in many scientific domains. To extract the hidden knowledge/correlations in these graphs, novel methods need to be developed to analyse these graphs fast. In this paper, we present a unified framework of stochastic matrix-function estimators, which allows one to compute a subset of elements of the matrix f(A), where f is an arbitrary function and A is the adjacency matrix of the graph. The new framework has a computational cost proportional to the size of the subset, i.e. to obtain the diagonal of f(A) with matrix-size N, the computational cost is proportional to N contrary to the traditional N^3 from diagonalization. Furthermore, we will show that the new framework allows us to write implementations of the algorithm that scale naturally with the number of compute nodes and is easily ported to accelerators where the kernels perform very well.
knowledge discovery and data mining | 2018
Peter W. J. Staar; Michele Dolfi; Christoph Auer; Costas Bekas
Over the past few decades, the amount of scientific articles and technical literature has increased exponentially in size. Consequently, there is a great need for systems that can ingest these documents at scale and make the contained knowledge discoverable. Unfortunately, both the format of these documents (e.g. the PDF format or bitmap images) as well as the presentation of the data (e.g. complex tables) make the extraction of qualitative and quantitive data extremely challenging. In this paper, we present a modular, cloud-based platform to ingest documents at scale. This platform, called the Corpus Conversion Service (CCS), implements a pipeline which allows users to parse and annotate documents (i.e. collect ground-truth), train machine-learning classification algorithms and ultimately convert any type of PDF or bitmap-documents to a structured content representation format. We will show that each of the modules is scalable due to an asynchronous microservice architecture and can therefore handle massive amounts of documents. Furthermore, we will show that our capability to gather groundtruth is accelerated by machine-learning algorithms by at least one order of magnitude. This allows us to both gather large amounts of ground-truth in very little time and obtain very good precision/recall metrics in the range of 99% with regard to content conversion to structured output. The CCS platform is currently deployed on IBM internal infrastructure and serving more than 250 active users for knowledge-engineering project engagements.
international conference on embedded computer systems architectures modeling and simulation | 2016
Giorgis Georgakoudis; Charles J. Gillan; Ahmad Hassan; Umar Ibrahim Minhas; Ivor T. A. Spence; George Tzenakis; Hans Vandierendonck; Roger F. Woods; Dimitrios S. Nikolopoulos; Murali Shyamsundar; Paul Barber; Matthew Russell; Angelos Bilas; Stelios Kaloutsakis; Heiner Giefers; Peter W. J. Staar; Costas Bekas; Neil Horlock; Richard Faloon; Colin Pattison
NanoStreams explores the design, implementation, and system software stack of micro-servers aimed at processing data in-situ and in real time. These micro-servers can serve the emerging Edge computing ecosystem, namely the provisioning of advanced computational, storage, and networking capability near data sources to achieve both low latency event processing and high throughput analytical processing, before considering off-loading some of this processing to high-capacity data centres. Nano Streams explores a scale-out micro-server architecture that can achieve equivalent QoS to that of conventional rack-mounted servers for high-capacity data centres, but with dramatically reduced form factors and power consumption. To this end, Nano Streams introduces novel solutions in programmable & configurable hardware accelerators, as well as the system software stack used to access, share, and program those accelerators. Our Nano Streams micro-server prototype has demonstrated 5.5 x higher energy-efficiency than a standard Xeon Server. Simulations of the micro servers memory system extended to leverage hybrid DDR/NVM main memory indicated 5x higher energy-efficiency than a conventional DDR-based system.
ieee international conference on high performance computing, data, and analytics | 2016
Valéry Weber; A. Cristiano I. Malossi; Ivano Tavernelli; Teodoro Laino; Costas Bekas; Manish Modani; Nina Wilner; Tom Heller; Alessandro Curioni
In this article, we present the algorithmic adaptation and code re-engineering required for porting highly successful and popular planewave codes to next-generation heterogeneous OpenPOWER architectures that foster acceleration and high bandwidth links to GPUs. Here we focus on CPMD as the most representative software for ab initio molecular dynamics simulations. We have ported the construction of the electronic density, the application of the potential to the wavefunctions and the orthogonalization procedure to the GPU. The different GPU kernels consist mainly of fast Fourier transforms (FFT) and basic linear algebra operations (BLAS). The performance of the new implementation obtained on Firestone (POWER8/Tesla) is discussed. We show that the communication between the host and the GPU contributes a large fraction of the total run time. We expect a strong attenuation of the communication bottleneck when the NVLink high-speed interconnect will be available.