M. Graham Lopez | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where M. Graham Lopez is active.

Explore More

Publication

Featured researches published by M. Graham Lopez.

high performance interconnects | 2015

UCX: An Open Source Framework for HPC Network APIs and Beyond

Pavel Shamis; Manjunath Gorentla Venkata; M. Graham Lopez; Matthew B. Baker; Oscar R. Hernandez; Yossi Itigin; Mike Dubman; Gilad Shainer; Richard L. Graham; Liran Liss; Yiftah Shahar; Sreeram Potluri; Davide Rossetti; Donald Becker; Duncan Poole; Christopher Lamb; Sameer Kumar; Craig B. Stunkel; George Bosilca; Aurelien Bouteiller

This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort of national laboratories, industry, and academia to design and implement a high-performing and highly-scalable network stack for next generation applications and systems. UCX design provides the ability to tailor its APIs and network functionality to suit a wide variety of application domains and hardware. We envision these APIs to satisfy the networking needs of many programming models such as Message Passing Interface (MPI), OpenSHMEM, Partitioned Global Address Space (PGAS) languages, task-based paradigms and I/O bound applications. To evaluate the design we implement the APIs and protocols, and measure the performance of overhead-critical network primitives fundamental for implementing many parallel programming models and system libraries. Our results show that the latency, bandwidth, and message rate achieved by the portable UCX prototype is very close to that of the underlying driver. With UCX, we achieved a message exchange latency of 0.89 us, a bandwidth of 6138.5 MB/s, and a message rate of 14 million messages per second. As far as we know, this is the highest bandwidth and message rate achieved by any network stack (publicly known) on this hardware.

ieee international conference on high performance computing data and analytics | 2016

Towards achieving performance portability using directives for accelerators

M. Graham Lopez; Verónica G. Vergara Larrea; Wayne Joubert; Oscar R. Hernandez; Azzam Haidar; Stanimire Tomov; Jack J. Dongarra

In this paper we explore the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architectures with attached accelerators, both self-hosted multicore and offload multicore/GPU. Our goal is to examine how successful OpenACC and the newer offload features of OpenMP 4.5 are for moving codes between architectures, how much tuning might be required and what lessons we can learn from this experience. To do this, we use examples of algorithms with varying computational intensities for our evaluation, as both compute and data access efficiency are important considerations for overall application performance. We implement these kernels using various methods provided by newer OpenACC and OpenMP implementations, and we evaluate their performance on various platforms including both X86_64 with attached NVIDIA GPUs, self-hosted Intel Xeon Phi KNL, as well as an X86_64 host system with Intel Xeon Phi coprocessors. In this paper, we explain what factors affected the performance portability such as how to pick the right programming model, its programming style, its availability on different platforms, and how well compilers can optimize and target to multiple platforms.

OpenSHMEM 2015 Revised Selected Papers of the Second Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Technologies - Volume 9397 | 2015

Exploring OpenSHMEM Model to Program GPU-based Extreme-Scale Systems

Sreeram Potluri; Davide Rossetti; Donald Becker; Duncan Poole; Manjunath Gorentla Venkata; Oscar R. Hernandez; Pavel Shamis; M. Graham Lopez; Matthew B. Baker; Wendy Poole

Extreme-scale systems with compute accelerators such as Graphical Processing Unit GPUs have become popular for executing scientific applications. These systems are typically programmed using MPI and CUDA for NVIDIA based GPUs. However, there are many drawbacks to the MPI+CUDA approach. The orchestration required between the compute and communication phases of the application execution, and the constraint that communication can only be initiated from serial portions on the Central Processing Unit CPU lead to scaling bottlenecks. To address these drawbacks, we explore the viability of using OpenSHMEMfor programming these systems. In this paper, first, we make a case for supporting GPU-initiated communication, and suitability of the OpenSHMEMprogramming model. Second, we present NVSHMEM, a prototype implementation of the proposed programming approach, port Stencil and Transpose benchmarks which are representative of many scientific applications from MPI+CUDA model to OpenSHMEM, and evaluate the design and implementation of NVSHMEM. Finally, we provide a discussion on the opportunities and challenges of OpenSHMEMto program these systems, and propose extensions to OpenSHMEMto achieve the full potential of this programming approach.

Journal of Physics: Condensed Matter | 2018

QMCPACK: An open source ab initio quantum Monte Carlo package for the electronic structure of atoms, molecules and solids

Jeongnim Kim; Andrew David Baczewski; Todd D Beaudet; Anouar Benali; M. Chandler Bennett; M. Berrill; N. S. Blunt; Edgar Josué Landinez Borda; Michele Casula; David M. Ceperley; Simone Chiesa; Bryan K. Clark; Raymond Clay; Kris T. Delaney; Mark Douglas Dewing; Kenneth Esler; Hongxia Hao; Olle Heinonen; Paul R. C. Kent; Jaron T. Krogel; Ilkka Kylänpää; Ying Wai Li; M. Graham Lopez; Ye Luo; Fionn D. Malone; Richard M. Martin; Amrita Mathuriya; Jeremy McMinis; Cody Melton; Lubos Mitas

QMCPACK is an open source quantum Monte Carlo package for ab initio electronic structure calculations. It supports calculations of metallic and insulating solids, molecules, atoms, and some model Hamiltonians. Implemented real space quantum Monte Carlo algorithms include variational, diffusion, and reptation Monte Carlo. QMCPACK uses Slater-Jastrow type trial wavefunctions in conjunction with a sophisticated optimizer capable of optimizing tens of thousands of parameters. The orbital space auxiliary-field quantum Monte Carlo method is also implemented, enabling cross validation between different highly accurate methods. The code is specifically optimized for calculations with large numbers of electrons on the latest high performance computing architectures, including multicore central processing unit and graphical processing unit systems. We detail the programs capabilities, outline its structure, and give examples of its use in current research calculations. The package is available at http://qmcpack.org.

extreme science and engineering discovery environment | 2014

Large-scale Hydrodynamic Brownian Simulations on Multicore and GPU Architectures

M. Graham Lopez; Mitchel D. Horton; Edmond Chow

We present here ongoing work to produce an implementation of Brownian dynamics simulation using a matrix-free method with hardware accelerators. This work describes the GPU acceleration of a smooth particle-mesh Ewald (SPME) algorithm which is used for the main part of the computation, previously ported to run on Intel Xeon Phi.

Archive | 2014

Batch Matrix Exponentiation

M. Graham Lopez; Mitchel D. Horton

Matrix–matrix multiplication can be considered a linchpin of applied numerical dense linear algebra as the performance of many common dense linear algebra packages is closely tied to the performance of matrix–matrix multiplication. Batch matrix–matrix multiplication, the matrix–matrix multiplication of a large number of relatively small matrices, is a developing area within dense linear algebra and is relevant to various application areas such as phylogenetics, finite element modeling, image processing, fluid dynamics, and hydrodynamics. Using batch matrix–matrix multiplication as the foundation, we have developed an optimized batch matrix exponentiation algorithm in CUDA that outperforms cublasXgemmBatched for small square matrices. After introducing the original motivation for our problem, matrix exponentiation from the phylogenetics domain, we discuss our algorithm in the context of both cublasXgemmBatched, and two alternative GPU methods for the numerical computation of matrix exponentiation: Lagrange interpolation, and Newton interpolation. All comparisons are done on both the Fermi and the Kepler architectures.

ieee international conference on high performance computing, data, and analytics | 2017

OpenACC 2.5 Validation Testsuite Targeting Multiple Architectures

Kyle Friedline; Sunita Chandrasekaran; M. Graham Lopez; Oscar R. Hernandez

Heterogeneous computing has emerged as a promising fit for scientific domains such as molecular dynamics simulations, bioinformatics, weather prediction. Such a computing paradigm includes x86 processors coupled with GPUs, FPGAs, DSPs or a coprocessor paradigm that takes advantage of all the cores and caches on a single die such as the Knights Landing. OpenACC, a high-level directive-based parallel programming model has emerged as a programming paradigm that can tackle the intensity of heterogeneity in architectures. Data-driven large scientific codes are increasingly using OpenACC, which makes it essential to analyze the accuracy of OpenACC compilers while they port code to various types of platforms. In response, we have been creating a validation suite to validate and verify the implementations of OpenACC features in conformance with the specification. The validation suite also provides a tool to compiler developers as a standard for the compiler to be tested against and to users and compiler developers alike in clarifying the OpenACC specification. This testsuite has been integrated into the harness infrastructure of the TITAN and Summitdev systems at Oak Ridge National Lab and is being used for production.

ieee international conference on high performance computing, data, and analytics | 2016

Using C++ AMP to Accelerate HPC Applications on Multiple Platforms

M. Graham Lopez; Christopher Bergstrom; Ying Wai Li; Wael R. Elwasif; Oscar R. Hernandez

Many high-end HPC systems support accelerators in their compute nodes to target a variety of workloads including high-performance computing simulations, big data / data analytics codes and visualization. To program both the CPU cores and attached accelerators, users now have multiple programming models available such as CUDA, OpenMP 4, OpenACC, C++14, etc., but some of these models fall short in their support for C++ on accelerators because they can have difficulty supporting advanced C++ features e.g. templating, class members, loops with iterators, lambdas, deep copy, etc. Usually, they either rely on unified memory, or the programming language is not aware of accelerators (e.g. C++14). In this paper, we explore a base-language solution called C++ Accelerated Massive Parallelism (AMP), which was developed by Microsoft and implemented by the PathScale ENZO compiler to program GPUs on a variety of HPC architectures including OpenPOWER and Intel Xeon. We report some prelminary in-progress results using C++ AMP to accelerate a matrix multiplication and quantum Monte Carlo application kernel, examining its expressiveness and performance using NVIDIA GPUs and the PathScale ENZO compiler. We hope that this preliminary report will provide a data point that will inform the functionality needed for future C++ standards to support accelerators with discrete memory spaces.

international parallel and distributed processing symposium | 2017

Enabling One-Sided Communication Semantics on ARM

Pavel Shamis; M. Graham Lopez; Gilad Shainer

In this paper, we present our work to enable optimized one-sided communication operations on the ARM v8 architecture using a high-performance InfiniBand network interconnect, as well as an evaluation of our implementation. For this study, we started with an OpenSHMEM implementation based on Open MPI/SHMEM, and combined it with the UCX framework and the XPMEM kernel extension for shared memory communication. UCX is a unified communication abstraction that provides high-performance communication services over a variety of network interconnects and shared memory technologies. The UCX, XPMEM, and OpenSHMEM components were specially ported for this work in order to enable efficient access to shared memory and RDMA network capabilities on ARM. To the best of our knowledge, this is the first investigation of one-sided communication semantics and OpenSHMEM on the ARM architecture combined with a high-performance InfiniBand network and XPMEM shared memory transport.

ieee international conference on high performance computing data and analytics | 2017

Evaluation of Directive-based Performance Portable Programming Models

M. Graham Lopez; Verónica G. Vergara Larrea; Wayne Joubert; Oscar R. Hernandez; Azzam Haidar; Stanimire Tomov; Jack J. Dongarra

We present an extended exploration of the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architectures with attached accelerators. To do this, we use examples of algorithms with varying computational intensities for our evaluation, as both compute and data access efficiency are important considerations for overall application performance. We implement the kernels of interest using various methods provided by newer OpenACC and OpenMP implementations, and we evaluate their performance on various platforms including both x86\_64 and Power8 with attached NVIDIA GPUs, X86\_64 multicores, self-hosted Intel Xeon Phi KNL, as well as an X86\_64 host system with Intel Xeon Phi coprocessors. Furthermore, we present in detail what factors affected the performance portability, including how to pick the right programming model, its programming style, its availability on different platforms, and how well compilers can optimise and target multiple platforms.

Explore More