Is this you? Create Your Porfile

Mathias Jacquelin

Lawrence Berkeley National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mathias Jacquelin is active.

Explore More

Publication

Featured researches published by Mathias Jacquelin.

ACM Transactions on Mathematical Software | 2017

PSelInv—A Distributed Memory Parallel Algorithm for Selected Inversion: The Symmetric Case

Mathias Jacquelin; Lin Lin; Chao Yang

We describe an efficient parallel implementation of the selected inversion algorithm for distributed memory computer systems, which we call PSelInv. The PSelInv method computes selected elements of a general sparse matrix A that can be decomposed as A = LU, where L is lower triangular and U is upper triangular. The implementation described in this article focuses on the case of sparse symmetric matrices. It contains an interface that is compatible with the distributed memory parallel sparse direct factorization SuperLU_DIST. However, the underlying data structure and design of PSelInv allows it to be easily combined with other factorization routines, such as PARDISO. We discuss general parallelization strategies such as data and task distribution schemes. In particular, we describe how to exploit the concurrency exposed by the elimination tree associated with the LU factorization of A. We demonstrate the efficiency and accuracy of PSelInv by presenting several numerical experiments. In particular, we show that PSelInv can run efficiently on more than 4,000 cores for a modestly sized matrix. We also demonstrate how PSelInv can be used to accelerate large-scale electronic structure calculations.

international parallel and distributed processing symposium | 2014

Reconstructing Householder Vectors from Tall-Skinny QR

Grey Ballard; James Demmel; Laura Grigori; Mathias Jacquelin; Hong Diep Nguyen; Edgar Solomonik

The Tall-Skinny QR (TSQR) algorithm is more communication efficient than the standard Householder algorithm for QR decomposition of matrices with many more rows than columns. However, TSQR produces a different representation of the orthogonal factor and therefore requires more software development to support the new representation. Further, implicitly applying the orthogonal factor to the trailing matrix in the context of factoring a square matrix is more complicated and costly than with the Householder representation. We show how to perform TSQR and then reconstruct the Householder vector representation with the same asymptotic communication efficiency and little extra computational cost. We demonstrate the high performance and numerical stability of this algorithm both theoretically and empirically. The new Householder reconstruction algorithm allows us to design more efficient parallel QR algorithms, with significantly lower latency cost compared to Householder QR and lower bandwidth and latency costs compared with Communication-Avoiding QR (CAQR) algorithm. As a result, our final parallel QR algorithm outperforms ScaLAPACK and Elemental implementations of Householder QR and our implementation of CAQR on the Hopper Cray XE6 NERSC system. We also provide algorithmic improvements to the ScaLAPACK and CAQR algorithms.

Journal of Parallel and Distributed Computing | 2015

Reconstructing Householder vectors from Tall-Skinny QR

Grey Ballard; James Demmel; Laura Grigori; Mathias Jacquelin; Nicholas Knight; Hong Diep Nguyen

Computer Physics Communications | 2018

ELSI: A unified software interface for Kohn–Sham electronic structure solvers

Victor Yu; Fabiano Corsetti; Alberto García; William Huhn; Mathias Jacquelin; Weile Jia; Björn Lange; Lin Lin; Jianfeng Lu; Wenhui Mi; Ali Seifitokaldani; Alvaro Vazquez-Mayagoitia; Chao Yang; Haizhao Yang; Volker Blum

Abstract Solving the electronic structure from a generalized or standard eigenproblem is often the bottleneck in large scale calculations based on Kohn–Sham density-functional theory. This problem must be addressed by essentially all current electronic structure codes, based on similar matrix expressions, and by high-performance computation. We here present a unified software interface, ELSI, to access different strategies that address the Kohn–Sham eigenvalue problem. Currently supported algorithms include the dense generalized eigensolver library ELPA, the orbital minimization method implemented in libOMM, and the pole expansion and selected inversion (PEXSI) approach with lower computational complexity for semilocal density functionals. The ELSI interface aims to simplify the implementation and optimal use of the different strategies, by offering (a) a unified software framework designed for the electronic structure solvers in Kohn–Sham density-functional theory; (b) reasonable default parameters for a chosen solver; (c) automatic conversion between input and internal working matrix formats, and in the future (d) recommendation of the optimal solver depending on the specific problem. Comparative benchmarks are shown for system sizes up to 11,520 atoms (172,800 basis functions) on distributed memory supercomputing architectures. Program summary Program title: ELSI Interface Program Files doi: http://dx.doi.org/10.17632/y8vzhzdm62.1 Licensing provisions: BSD 3-clause Programming language: Fortran 2003, with interface to C/C++ External routines/libraries: MPI, BLAS, LAPACK, ScaLAPACK, ELPA, libOMM, PEXSI, ParMETIS, SuperLU_DIST Nature of problem: Solving the electronic structure from a generalized or standard eigenvalue problem in calculations based on Kohn–Sham density functional theory (KS-DFT). Solution method: To connect the KS-DFT codes and the KS electronic structure solvers, ELSI provides a unified software interface with reasonable default parameters, hierarchical control over the interface and the solvers, and automatic conversions between input and internal working matrix formats. Supported solvers are: ELPA (dense generalized eigensolver), libOMM (orbital minimization method), and PEXSI (pole expansion and selected inversion method). Restrictions: The ELSI interface requires complete information of the Hamiltonian matrix.

ieee international conference on high performance computing data and analytics | 2018

A Left-Looking Selected Inversion Algorithm and Task Parallelism on Shared Memory Systems

Mathias Jacquelin; Lin Lin; Weile Jia; Yonghua Zhao; Chao Yang

Given a sparse matrix A, the selected inversion algorithm is an efficient method for computing certain selected elements of A-1. These selected elements correspond to all or some nonzero elements of the LU factors of A. In many ways, the types of matrix updates performed in the selected inversion algorithm are similar to those performed in the LU factorization, although the sequence of operations is different. In the context of LU factorization, it is known that the left-looking and right-looking algorithms exhibit different memory access and data communication patterns, and hence different behavior on shared memory and distributed memory parallel machines. Corresponding to right-looking and left-looking LU factorization, the selected inversion algorithm can be organized as a left-looking or a right-looking algorithm. The parallel right-looking version of the algorithm has been developed in [9]. The sequence of operations performed in this version of the selected inversion algorithm is similar to those performed in a left-looking LU factorization algorithm. In this paper, we describe the left-looking variant of the selected inversion algorithm, and present an efficient implementation of the algorithm for shared memory machines using a task parallel method. We demonstrate that with the task scheduling features provided by OpenMP 4.0, the left-looking selected inversion algorithm can scale well both on the Intel Haswell multicore architecture and on the Intel Knights Landing (KNL) manycore architecture up to 16 and 64 cores, respectively. On the KNL architecture, we observe that the maximum parallel efficiency achieved by the left-looking selected inversion algorithm can be as high as 62% even when all 64 cores are used, despite the inherent asynchronous nature of the computation and communication patterns in sparse matrix operations. Compared to the right-looking selected inversion algorithm, the left-looking formulation facilitates efficient pipelining of operations along different branches of the elimination tree, and can be a promising candidate for future development of massively parallel selected inversion algorithms on heterogeneous architectures.

parallel computing | 2017

PSelInv – A distributed memory parallel algorithm for selected inversion: The non-symmetric case

Mathias Jacquelin; Lin Lin; Chao Yang

Abstract This paper generalizes the parallel selected inversion algorithm called PSelInv to sparse non-symmetric matrices. We assume a general sparse matrix A has been decomposed as P A Q = L U on a distributed memory parallel machine, where L, U are lower and upper triangular matrices, and P, Q are permutation matrices, respectively. The PSelInv method computes selected elements of A − 1 . The selection is confined by the sparsity pattern of the matrix AT. Our algorithm does not assume any symmetry properties of A, and our parallel implementation is memory efficient, in the sense that the computed elements of A − T overwrites the sparse matrix L + U in situ. PSelInv involves a large number of collective data communication activities within different processor groups of various sizes. In order to minimize idle time and improve load balancing, tree-based asynchronous communication is used to coordinate all such collective communication. Numerical results demonstrate that PSelInv can scale efficiently to 6,400 cores for a variety of matrices.

acm symposium on parallel algorithms and architectures | 2018

A 3D Parallel Algorithm for QR Decomposition

Grey Ballard; James Demmel; Laura Grigori; Mathias Jacquelin; Nicholas Knight

Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing its latency cost (number of messages). By varying a parameter to navigate the bandwidth/latency tradeoff, we can tune this algorithm for machines with different communication costs.

Archive | 2018

UPC++ Specification v1.0, Draft 4

John Bachan; Scott B. Baden; Dan Bonachea; Paul Hargrove; Steven A. Hofmeyr; Khaled Z. Ibrahim; Mathias Jacquelin; Amir Kamil; B. Lelbach; B. van Straalen

Author(s): Bachan, J; Baden, Scott; Bonachea, Dan; Hargrove, Paul; Hofmeyr, S; Ibrahim, Khaled; Jacquelin, M; Kamil, Amir; Lelbach, B; van Straalen, B | Abstract: This document has been superseded by: UPC++ Specification v1.0, Draft 8 (LBNL-2001179) https://escholarship.org/uc/item/55f9x4wg UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

Archive | 2018

UPC++ Programmer’s Guide, v1.0-2017.9

John Bachan; Scott B. Baden; Dan Bonachea; Paul Hargrove; Steven A. Hofmeyr; Khaled Z. Ibrahim; Mathias Jacquelin; Amir Kamil; B. van Straalen

Author(s): Bachan, J; Baden, S; Bonachea, D; Hargrove, P; Hofmeyr, S; Ibrahim, K; Jacquelin, M; Kamil, A; van Straalen, B | Abstract: This document has been superseded by: UPC++ Programmer’s Guide, v1.0-2018.3.0 (LBNL-2001136) https://escholarship.org/uc/item/10g5t8jr UPC++ is a C++11 library that provides Asynchronous Partitioned Global Address Space (APGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The APGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, APGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

ieee international conference on high performance computing, data, and analytics | 2017

Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel ® Xeon Phi™ Processor

Eric J. Bylaska; Mathias Jacquelin; Wibe A. de Jong; Jeff R. Hammond; Michael Klemm

Ab-initio Molecular Dynamics (AIMD) methods are an important class of algorithms, as they enable scientists to understand the chemistry and dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. Many-core architectures such as the Intel® Xeon Phi™ processor are an interesting and promising target for these algorithms, as they can provide the computational power that is needed to solve interesting problems in chemistry. In this paper, we describe the efforts of refactoring the existing AIMD plane-wave method of NWChem from an MPI-only implementation to a scalable, hybrid code that employs MPI and OpenMP to exploit the capabilities of current and future many-core architectures. We describe the optimizations required to get close to optimal performance for the multiplication of the tall-and-skinny matrices that form the core of the computational algorithm. We present strong scaling results on the complete AIMD simulation for a test case that simulates 256 water molecules and that strong-scales well on a cluster of 1024 nodes of Intel Xeon Phi processors. We compare the performance obtained with a cluster of dual-socket Intel® Xeon® E5–2698v3 processors.

Explore More