Elmar Peise | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Elmar Peise is active.

Explore More

Publication

Featured researches published by Elmar Peise.

SIAM Journal on Scientific Computing | 2013

High-Performance Solvers for Dense Hermitian Eigenproblems

Matthias Petschow; Elmar Peise; Paolo Bientinesi

We introduce a new collection of solvers - subsequently called EleMRRR - for large-scale dense Hermitian eigenproblems. EleMRRR solves various types of problems: generalized, standard, and tridiagonal eigenproblems. Among these, the last is of particular importance as it is a solver on its own right, as well as the computational kernel for the first two; we present a fast and scalable tridiagonal solver based on the Algorithm of Multiple Relatively Robust Representations - referred to as PMRRR. Like the other EleMRRR solvers, PMRRR is part of the freely available Elemental library, and is designed to fully support both message-passing (MPI) and multithreading parallelism (SMP). As a result, the solvers can equally be used in pure MPI or in hybrid MPI-SMP fashion. We conducted a thorough performance study of EleMRRR and ScaLAPACKs solvers on two supercomputers. Such a study, performed with up to 8,192 cores, provides precise guidelines to assemble the fastest solver within the ScaLAPACK framework; it also indicates that EleMRRR outperforms even the fastest solvers built from ScaLAPACKs components.

ieee international conference on high performance computing data and analytics | 2012

Performance Modeling for Dense Linear Algebra

Elmar Peise; Paolo Bientinesi

It is well known that the behavior of dense linear algebra algorithms is greatly influenced by factors like target architecture, underlying libraries and even problem size; because of this, the accurate prediction of their performance is a real challenge. In this article, we are not interested in creating accurate models for a given algorithm, but in correctly ranking a set of equivalent algorithms according to their performance. Aware of the hierarchical structure of dense linear algebra routines, we approach the problem by developing a framework for the automatic generation of statistical performance models for BLAS and LAPACK libraries. This allows us to obtain predictions through evaluating and combining such models. We demonstrate that our approach is successful in both single- and multi-core environments, not only in the ranking of algorithms but also in tuning their parameters.

arXiv: Mathematical Software | 2014

On the Performance Prediction of BLAS-based Tensor Contractions

Elmar Peise; Diego Fabregat-Traver; Paolo Bientinesi

Tensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one- and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.

International Journal of High Performance Computing Applications | 2018

The ELAPS framework: Experimental Linear Algebra Performance Studies

Elmar Peise; Paolo Bientinesi

In scientific computing, optimal use of computing resources comes at the cost of extensive coding, tuning, and benchmarking. While the classic approach of “features first, performance later” is supported by a variety of tools such as Tau, Vampir, and Scalasca, the emerging performance-centric approach, in which both features and performance are primary objectives, is still lacking suitable development tools. For dense linear algebra applications, we fill this gap with the Experimental Linear Algebra Performance Studies (ELAPS) framework, a multi-platform open-source environment for easy, fast, and yet powerful performance experimentation and prototyping. In contrast to many existing tools, ELAPS targets the beginning of the development process, assisting application developers in both algorithmic and optimization decisions. With ELAPS, users construct experiments to investigate how performance and efficiency depend on factors such as caching, algorithmic parameters, problem size, and parallelism. Experiments are designed either through Python scripts or a specialized Graphical User Interface (GUI), and run on a spectrum of architectures, ranging from laptops to accelerators and clusters. The resulting reports provide various metrics and statistics that can be analyzed both numerically and visually. In this article, we introduce ELAPS and illustrate its practical value in guiding critical performance decisions already in early development stages.

parallel computing | 2015

High performance solutions for big-data GWAS

Elmar Peise; Diego Fabregat-Traver; Paolo Bientinesi

We consider mixed-models based genome-wide association studies of large scale.We address GWAS with only a single trait and with many traits.GWAS with arbitrarily large populations, and of arbitrarily many SNPs and traits are enabled.Distributed memory architectures, such as Cloud, clusters, and supercomputers are used.Scalability with respect to problem size and resources is demonstrated. In order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: (1) millions of polymorphisms and thousands of phenotypes come at the cost of hundreds of gigabytes of data, which can only be kept in secondary storage; (2) the relatedness of the test population is represented by a relationship matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, by using distributed resources such as Cloud or clusters, we address both challenges: the genotype and phenotype data is streamed from secondary storage using the double-buffering technique, while the relationship matrix is kept across the main memory of a distributed memory system. With the help of these solutions, we develop separate algorithms for studies involving only one or a multitude of traits. We show that these algorithms sustain high-performance and allow the analysis of enormous datasets.

ieee international conference on high performance computing data and analytics | 2014

A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels

Elmar Peise; Paolo Bientinesi

It is universally known that caching is critical to attain high-performance implementations: In many situations, data locality (in space and time) plays a bigger role than optimizing the (number of) arithmetic floating point operations. In this paper, we show evidence that at least for linear algebra algorithms, caching is also a crucial factor for accurate performance modeling and performance prediction.

arXiv: Computational Engineering, Finance, and Science | 2013

Algorithms for large-scale whole genome association analysis

Elmar Peise; Diego Fabregat-Traver; Yurii S. Aulchenko; Paolo Bientinesi

In order to associate complex traits with genetic polymorphisms, genome-wide association studies process huge datasets involving tens of thousands of individuals genotyped for millions of polymorphisms. When handling these datasets, which exceed the main memory of contemporary computers, one faces two distinct challenges: 1) Millions of polymorphisms come at the cost of hundreds of Gigabytes of genotype data, which can only be kept in secondary storage; 2) the relatedness of the test population is represented by a covariance matrix, which, for large populations, can only fit in the combined main memory of a distributed architecture. In this paper, we present solutions for both challenges: The genotype data is streamed from and to secondary storage using a double buffering technique, while the covariance matrix is kept across the main memory of a distributed memory system. We show that these methods sustain high-performance and allow the analysis of enormous datasets.

ACM Transactions on Mathematical Software | 2017

Algorithm 979: Recursive Algorithms for Dense Linear Algebra—The ReLAPACK Collection

Elmar Peise; Paolo Bientinesi

To exploit both memory locality and the full performance potential of highly tuned kernels, dense linear algebra libraries, such as linear algebra package (LAPACK), commonly implement operations as blocked algorithms. However, to achieve near-optimal performance with such algorithms, significant tuning is required. In contrast, recursive algorithms are virtually tuning free and attain similar performance. In this article, we first analyze and compare blocked and recursive algorithms in terms of performance and then introduce recursive LAPACK (ReLAPACK), an open-source library of recursive algorithms to seamlessly replace many of LAPACK’s blocked algorithms. In most scenarios, ReLAPACK outperforms reference LAPACK and in many situations improves upon the performance of optimized libraries.

Computer Physics Communications | 2017