Jeff R. Hammond | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jeff R. Hammond is active.

Explore More

Publication

Featured researches published by Jeff R. Hammond.

ACM Transactions on Mathematical Software | 2013

Elemental: A New Framework for Distributed Memory Dense Matrix Computations

Jack Poulson; Bryan Marker; Robert A. van de Geijn; Jeff R. Hammond; Nichols A. Romero

Parallelizing dense matrix computations to distributed memory architectures is a well-studied subject and generally considered to be among the best understood domains of parallel computing. Two packages, developed in the mid 1990s, still enjoy regular use: ScaLAPACK and PLAPACK. With the advent of many-core architectures, which may very well take the shape of distributed memory architectures within a single processor, these packages must be revisited since the traditional MPI-based approaches will likely need to be extended. Thus, this is a good time to review lessons learned since the introduction of these two packages and to propose a simple yet effective alternative. Preliminary performance results show the new solution achieves competitive, if not superior, performance on large clusters.

Journal of Chemical Theory and Computation | 2011

Coupled Cluster Theory on Graphics Processing Units I. The Coupled Cluster Doubles Method.

A. Eugene DePrince; Jeff R. Hammond

The coupled cluster (CC) ansatz is generally recognized as providing one of the best wave function-based descriptions of electronic correlation in small- and medium-sized molecules. The fact that the CC equations with double excitations (CCD) may be expressed as a handful of dense matrix-matrix multiplications makes it an ideal method to be ported to graphics processing units (GPUs). We present our implementation of the spin-free CCD equations in which the entire iterative procedure is evaluated on the GPU. The GPU-accelerated algorithm readily achieves a factor of 4-5 speedup relative to the multithreaded CPU algorithm on same-generation hardware. The GPU-accelerated algorithm is approximately 8-12 times faster than Molpro, 17-22 times faster than NWChem, and 21-29 times faster than GAMESS for each CC iteration. Single-precision GPU-accelerated computations are also performed, leading to an additional doubling of performance. Single-precision errors in the energy are typically on the order of 10(-6) hartrees and can be improved by about an order of magnitude by performing one additional iteration in double precision.

Journal of Parallel and Distributed Computing | 2014

A massively parallel tensor contraction framework for coupled-cluster computations

Edgar Solomonik; Devin A. Matthews; Jeff R. Hammond; John F. Stanton; James Demmel

Precise calculation of molecular electronic wavefunctions by methods such as coupled-cluster requires the computation of tensor contractions, the cost of which has polynomial computational scaling with respect to the system and basis set sizes. Each contraction may be executed via matrix multiplication on a properly ordered and structured tensor. However, data transpositions are often needed to reorder the tensors for each contraction. Writing and optimizing distributed-memory kernels for each transposition and contraction is tedious since the number of contractions scales combinatorially with the number of tensor indices. We present a distributed-memory numerical library (Cyclops Tensor Framework (CTF)) that automatically manages tensor blocking and redistribution to perform any user-specified contractions. CTF serves as the distributed-memory contraction engine in Aquarius, a new program designed for high-accuracy and massively-parallel quantum chemical computations. Aquarius implements a range of coupled-cluster and related methods such as CCSD and CCSDT by writing the equations on top of a C++ templated domain-specific language. This DSL calls CTF directly to manage the data and perform the contractions. Our CCSD and CCSDT implementations achieve high parallel scalability on the BlueGene/Q and Cray XC30 supercomputer architectures showing that accurate electronic structure calculations can be effectively carried out on top of general distributed-memory tensor primitives. We introduce Cyclops Tensor Framework (CTF), a distributed-memory library for tensor contractions.CTF is able to perform tensor decomposition, redistribution, and contraction at runtime.CTF enables the expression of massively-parallel coupled-cluster methods via a concise tensor contraction interface.The quantum chemistry software suite Aquarius employs CTF to execute two coupled-cluster methods: CCSD and CCSDT.The Aquarius CCSD and CCSDT codes scale well on BlueGene/Q and Cray XC30, comparing favorably to NWChem.

Journal of Chemical Theory and Computation | 2011

Flow-Dependent Unfolding and Refolding of an RNA by Nonequilibrium Umbrella Sampling.

Alex Dickson; Mark Maienschein-Cline; Allison Tovo-Dwyer; Jeff R. Hammond; Aaron R. Dinner

Nonequilibrium experiments of single biomolecules such as force-induced unfolding reveal details about a few degrees of freedom of a complex system. Molecular dynamics simulations can provide complementary information, but exploration of the space of possible configurations is often hindered by large barriers in phase space that separate metastable regions. To solve this problem, enhanced sampling methods have been developed that divide a phase space into regions and integrate trajectory segments in each region. These methods boost the probability of passage over barriers and facilitate parallelization since integration of the trajectory segments does not require communication, aside from their initialization and termination. Here, we present a parallel version of an enhanced sampling method suitable for systems driven far from equilibrium: nonequilibrium umbrella sampling (NEUS). We apply this method to a coarse-grained model of a 262-nucleotide RNA molecule that unfolds and refolds in an explicit flow field modeled with stochastic rotation dynamics. Using NEUS, we are able to observe extremely rare unfolding events that have mean first passage times as long as 45 s (1.1 × 10(15) dynamics steps). We examine the unfolding process for a range of flow rates of the medium, and we describe two competing pathways in which different intramolecular contacts are broken.

Journal of Chemical Theory and Computation | 2013

Reconsidering Dispersion Potentials: Reduced Cutoffs in Mesh-Based Ewald Solvers Can Be Faster Than Truncation.

Rolf E. Isele-Holder; Wayne Mitchell; Jeff R. Hammond; Axel Kohlmeyer; Ahmed E. Ismail

Long-range dispersion interactions have a critical influence on physical quantities in simulations of inhomogeneous systems. However, the perceived computational overhead of long-range solvers has until recently discouraged their implementation in molecular dynamics packages. Here, we demonstrate that reducing the cutoff radius for local interactions in the recently introduced particle-particle particle-mesh (PPPM) method for dispersion [Isele-Holder et al., J. Chem. Phys., 2012, 137, 174107] can actually often be faster than truncating dispersion interactions. In addition, because all long-range dispersion interactions are incorporated, physical inaccuracies that arise from truncating the potential can be avoided. Simulations using PPPM or other mesh Ewald solvers for dispersion can provide results more accurately and more efficiently than simulations that truncate dispersion interactions. The use of mesh-based approaches for dispersion is now a viable alternative for all simulations containing dispersion interactions and not merely those where inhomogeneities were motivating factors for their use. We provide a set of parameters for the dispersion PPPM method using either ik or analytic differentiation that we recommend for future use and demonstrate increased simulation efficiency by using the long-range dispersion solver in a series of performance tests on massively parallel computers.

OpenSHMEM 2014 Proceedings of the First Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools - Volume 8356 | 2014

Implementing OpenSHMEM Using MPI-3 One-Sided Communication

Jeff R. Hammond; Sayan Ghosh; Barbara M. Chapman

This paper reports the design and implementation of Open- SHMEM over MPI using new one-sided communication features in MPI- 3, which include not only new functions (e.g. remote atomics) but also a newmemory model that is consistent with that of SHMEM.We use a new, non-collective MPI communicator creation routine to allow SHMEM collectives to use their MPI counterparts. Finally, we leverage MPI sharedmemory windows within a node, which allows direct (load-store) access. Performance evaluations are conducted for shared-memory and InfiniBand conduits using microbenchmarks.

SIAM Journal on Scientific Computing | 2016

MADNESS: A Multiresolution, Adaptive Numerical Environment for Scientific Simulation

Robert J. Harrison; Gregory Beylkin; Florian A. Bischoff; Justus A. Calvin; George I. Fann; Jacob Fosso-Tande; Diego Galindo; Jeff R. Hammond; Rebecca Hartman-Baker; Judith C. Hill; Jun Jia; Jakob Siegfried Kottmann; M-J. Yvonne Ou; Junchen Pei; Laura E. Ratcliff; M. Reuter; Adam C. Richie-Halford; Nichols A. Romero; Hideo Sekino; W. A. Shelton; Bryan Sundahl; W. Scott Thornton; Edward F. Valeev; Alvaro Vazquez-Mayagoitia; Nicholas Vence; Takeshi Yanai; Yukina Yokoi

MADNESS (multiresolution adaptive numerical environment for scientific simulation) is a high-level software environment for solving integral and differential equations in many dimensions that uses adaptive and fast harmonic analysis methods with guaranteed precision that are based on multiresolution analysis and separated representations. Underpinning the numerical capabilities is a powerful petascale parallel programming environment that aims to increase both programmer productivity and code scalability. This paper describes the features and capabilities of MADNESS and briefly discusses some current applications in chemistry and several areas of physics.

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface | 2011

Noncollective communicator creation in MPI

James Dinan; Sriram Krishnamoorthy; Pavan Balaji; Jeff R. Hammond; Manojkumar Krishnan; Vinod Tipparaju; Abhinav Vishnu

MPI communicators abstract communication operations across application modules, facilitating seamless composition of different libraries. In addition, communicators provide the ability to form groups of processes and establish multiple levels of parallelism. Traditionally, communicators have been collectively created in the context of the parent communicator. The recent thrust toward systems at petascale and beyond has brought forth new application use cases, including fault tolerance and load balancing, that highlight the ability to construct an MPI communicator in the context of its new process group as a key capability. However, it has long been believed that MPI is not capable of allowing the user to form a new communicator in this way. We present a new algorithm that allows the user to create such flexible process groups using only the functionality given in the current MPI standard. We explore performance implications of this technique and demonstrate its utility for load balancing in the context of a Markov chain Monte Carlo computation. In comparison with a traditional collective approach, noncollective communicator creation enables a 30% improvement in execution time through asynchronous load balancing.

Journal of Chemical Physics | 2015

Parallel scalability of Hartree–Fock calculations

Edmond Chow; Xing Liu; Mikhail Smelyanskiy; Jeff R. Hammond

Quantum chemistry is increasingly performed using large cluster computers consisting of multiple interconnected nodes. For a fixed molecular problem, the efficiency of a calculation usually decreases as more nodes are used, due to the cost of communication between the nodes. This paper empirically investigates the parallel scalability of Hartree-Fock calculations. The construction of the Fock matrix and the density matrix calculation are analyzed separately. For the former, we use a parallelization of Fock matrix construction based on a static partitioning of work followed by a work stealing phase. For the latter, we use density matrix purification from the linear scaling methods literature, but without using sparsity. When using large numbers of nodes for moderately sized problems, density matrix computations are network-bandwidth bound, making purification methods potentially faster than eigendecomposition methods.

international conference on supercomputing | 2013

Inspector/executor load balancing algorithms for block-sparse tensor contractions

David Ozog; Sameer Shende; Allen D. Malony; Jeff R. Hammond; James Dinan; Pavan Balaji

Developing effective yet scalable load-balancing methods for irregular computations is critical to the successful application of simulations in a variety of disciplines at petascale and beyond. This paper explores a set of static and dynamic scheduling algorithms for block-sparse tensor contractions within the NWChem computational chemistry code for different degrees of sparsity (and therefore load imbalance). In this particular application, a relatively large amount of task information can be obtained at minimal cost, which enables the use of static partitioning techniques that take the entire task list as input. However, fully static partitioning is incapable of dealing with dynamic variation of task costs, such as from transient network contention or operating system noise, so we also consider hybrid schemes that utilize dynamic scheduling within subgroups. These two schemes, which have not been previously implemented in NWChem or its proxies (i.e. quantum chemistry mini-apps) are compared to the original centralized dynamic load-balancing algorithm as well as improved centralized scheme. In all cases, we separate the scheduling of tasks from the execution of tasks into an inspector phase and an executor phase. The impact of these methods upon the application is substantial on a large InfiniBand cluster: execution time is reduced by as much as 50% at scale. The technique is applicable to any scientific application requiring load balance where performance models or estimations of kernel execution times are available.

Explore More