Mark Hoemmen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mark Hoemmen is active.

Explore More

Publication

Featured researches published by Mark Hoemmen.

SIAM Journal on Scientific Computing | 2012

Communication-optimal Parallel and Sequential QR and LU Factorizations

James Demmel; Laura Grigori; Mark Hoemmen; Julien Langou

We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR. Our first algorithm, Tall Skinny QR (TSQR), factors m-by-n matrices in a one-dimensional (1-D) block cyclic row layout, and is optimized for m >> n. Our second algorithm, CAQR (Communication-Avoiding QR), factors general rectangular matrices distributed in a two-dimensional block cyclic layout. It invokes TSQR for each block column factorization.

ieee international conference on high performance computing data and analytics | 2009

Minimizing communication in sparse matrix solvers

Marghoob Mohiyuddin; Mark Hoemmen; James Demmel; Katherine A. Yelick

Data communication within the memory system of a single processor node and between multiple nodes in a system is the bottleneck in many iterative sparse matrix solvers like CG and GMRES. Here k iterations of a conventional implementation perform k sparse-matrix-vector-multiplications and Ω(k) vector operations like dot products, resulting in communication that grows by a factor of Ω(k) in both the memory and network. By reorganizing the sparse-matrix kernel to compute a set of matrix-vector products at once and reorganizing the rest of the algorithm accordingly, we can perform k iterations by sending O(log P) messages instead of O(k · log P) messages on a parallel machine, and reading the matrix A from DRAM to cache just once, instead of k times on a sequential machine. This reduces communication to the minimum possible. We combine these techniques to form a new variant of GMRES. Our shared-memory implementation on an 8-core Intel Clovertown gets speedups of up to 4.3x over standard GMRES, without sacrificing convergence rate or numerical stability.

Acta Numerica | 2014

Communication lower bounds and optimal algorithms for numerical linear algebra

Grey Ballard; Erin Carson; James Demmel; Mark Hoemmen; Nicholas Knight; Oded Schwartz

The traditional metric for the efficiency of a numerical algorithm has been the number of arithmetic operations it performs. Technological trends have long been reducing the time to perform an arithmetic operation, so it is no longer the bottleneck in many algorithms; rather, communication, or moving data, is the bottleneck. This motivates us to seek algorithms that move as little data as possible, either between levels of a memory hierarchy or between parallel processors over a network. In this paper we summarize recent progress in three aspects of this problem. First we describe lower bounds on communication. Some of these generalize known lower bounds for dense classical (O(n3)) matrix multiplication to all direct methods of linear algebra, to sequential and parallel algorithms, and to dense and sparse matrices. We also present lower bounds for Strassen-like algorithms, and for iterative methods, in particular Krylov subspace methods applied to sparse matrices. Second, we compare these lower bounds to widely used versions of these algorithms, and note that these widely used algorithms usually communicate asymptotically more than is necessary. Third, we identify or invent new algorithms for most linear algebra problems that do attain these lower bounds, and demonstrate large speed-ups in theory and practice.

Scientific Programming | 2012

Amesos2 and Belos: Direct and iterative solvers for large sparse linear systems

Eric Bavier; Mark Hoemmen; Sivasankaran Rajamanickam; Heidi K. Thornquist

Solvers for large sparse linear systems come in two categories: direct and iterative. Amesos2, a package in the Trilinos software project, provides direct methods, and Belos, another Trilinos package, provides iterative methods. Amesos2 offers a common interface to many different sparse matrix factorization codes, and can handle any implementation of sparse matrices and vectors, via an easy-to-extend C++ traits interface. It can also factor matrices whose entries have arbitrary “Scalar” type, enabling extended-precision and mixed-precision algorithms. Belos includes many different iterative methods for solving large sparse linear systems and least-squares problems. Unlike competing iterative solver libraries, Belos completely decouples the algorithms from the implementations of the underlying linear algebra objects. This lets Belos exploit the latest hardware without changes to the code. Belos favors algorithms that solve higher-level problems, such as multiple simultaneous linear systems and sequences of related linear systems, faster than standard algorithms. The package also supports extended-precision and mixed-precision algorithms. Together, Amesos2 and Belos form a complete suite of sparse linear solvers.

international conference on parallel processing | 2011

Cooperative Application/OS DRAM fault recovery

Patrick G. Bridges; Mark Hoemmen; Kurt B. Ferreira; Michael A. Heroux; Philip Soltero; Ron Brightwell

Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application / OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results.

international parallel and distributed processing symposium | 2014

Improving the Performance of CA-GMRES on Multicores with Multiple GPUs

Ichitaro Yamazaki; Hartwig Anzt; Stanimire Tomov; Mark Hoemmen; Jack J. Dongarra

The Generalized Minimum Residual (GMRES) method is one of the most widely-used iterative methods for solving nonsymmetric linear systems of equations. In recent years, techniques to avoid communication in GMRES have gained attention because in comparison to floating-point operations, communication is becoming increasingly expensive on modern computers. Since graphics processing units (GPUs) are now becoming crucial component in computing, we investigate the effectiveness of these techniques on multicore CPUs with multiple GPUs. While we present the detailed performance studies of a matrix powers kernel on multiple GPUs, we particularly focus on orthogonalization strategies that have a great impact on both the numerical stability and performance of GMRES, especially as the matrix becomes sparser or ill-conditioned. We present the experimental results on two eight-core Intel Sandy Bridge CPUs with three NDIVIA Fermi GPUs and demonstrate that significant speedups can be obtained by avoiding communication, either on a GPU or between the GPUs. As part of our study, we investigate several optimization techniques for the GPU kernels that can also be used in other iterative solvers besides GMRES. Hence, our studies not only emphasize the importance of avoiding communication on GPUs, but they also provide insight about the effects of these optimization techniques on the performance of the sparse solvers, and may have greater impact beyond GMRES.

international conference on conceptual structures | 2015

Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience☆

Andrew A. Chien; Pavan Balaji; P. Beckman; Nan Dun; Aiman Fang; Hajime Fujita; Kamil Iskra; Zachary A. Rubenstein; Z. Zheng; R. Schreiber; J. Hammond; J. Dinan; Ignacio Laguna; D. Richards; A. Dubey; B. van Straalen; Mark Hoemmen; Michael A. Heroux; Keita Teranishi; Andrew R. Siegel

Abstract Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVRs interfaces to distributed arrays, versioning, and cross-layer error recovery. Using several large applications (OpenMC, the preconditioned conjugate gradient solver PCG, ddcMD, and Chombo), we evaluate the programmer effort to add resilience. The required changes are small ( 2% LOC), localized, and machine-independent, requiring no software architecture changes. We also measure the overhead of adding GVR versioning and show that generally overheads 2% are achieved. We conclude that GVRs interfaces and implementation are flexible and portable and create a gentle-slope path to tolerate growing error rates in future systems.

Parallel Processing Letters | 2014

TOWARDS EXTREME-SCALE SIMULATIONS FOR LOW MACH FLUIDS WITH SECOND-GENERATION TRILINOS

Paul Lin; Matthew Tyler Bettencourt; Stefan P. Domino; Travis C. Fisher; Mark Hoemmen; Jonathan Joseph Hu; Eric Todd Phipps; Andrey Prokopenko; Sivasankaran Rajamanickam; Christopher Siefert; Stephen Kennon

Trilinos is an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific problems. While Trilinos was originally designed for scalable solutions of large problems, the fidelity needed by many simulations is significantly greater than what one could have envisioned two decades ago. When problem sizes exceed a billion elements even scalable applications and solver stacks require a complete revision. The second-generation Trilinos employs C++ templates in order to solve arbitrarily large problems. We present a case study of the integration of Trilinos with a low Mach fluids engineering application (SIERRA low Mach module/Nalu). Through the use of improved algorithms and better software engineering practices, we demonstrate good weak scaling for up to a nine billion element large eddy simulation (LES) problem on unstructured meshes with a 27 billion row matrix on 524,288 cores of an IBM Blue Gene/Q platform.

international parallel and distributed processing symposium | 2014

Towards Extreme-Scale Simulations with Next-Generation Trilinos: A Low Mach Fluid Application Case Study

Trilinos is an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific problems. While the original version of Trilinos was designed for highly scalable solutions for large problems, the need for increasingly higher fidelity simulations has pushed the problem sizes beyond what could have been envisioned two decades ago. When problem sizes exceed a billion elements even highly scalable applications and solver stacks require a complete revision. The next-generation Trilinos employs C++ templates in order to solve arbitrarily large problems and enable extreme-scale simulations. We present a case study that involves integration of Trilinos with an engineering application (Sierra low Mach module/Nalu), involving the simulation of low Mach fluid flow for problems of size up to nine billion elements. Through the use of improved algorithms and better software engineering practices, we demonstrate good weak scaling for the matrix assembly and solve for the engineering application for up to a nine billion element fluid flow large eddy simulation (LES) problem on unstructured meshes with a 27 billion row matrix on 131,072 cores of a Cray XE6 platform.

SIAM Journal on Scientific Computing | 2009