James Dinan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where James Dinan is active.

Explore More

Publication

Featured researches published by James Dinan.

ieee international conference on high performance computing data and analytics | 2012

MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems

Ashwin M. Aji; James Dinan; Darius Buntinas; Pavan Balaji; Wu-chun Feng; Keith R. Bisset; Rajeev Thakur

Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement frameworks, thus providing applications with no direct mechanism to perform end-to-end data movement. We introduce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACCs runtime system enables several key optimizations, including pipelining of data transfers and balancing of communication based on accelerator and node architecture. We demonstrate the extensible design of MPIACC by using the popular CUDA and OpenCL accelerator programming interfaces. We examine the impact of MPI-ACC on communication performance and evaluate application-level benefits on a large-scale epidemiology simulation.

parallel computing | 2015

Remote Memory Access Programming in MPI-3

Torsten Hoefler; James Dinan; Rajeev Thakur; Brian W. Barrett; Pavan Balaji; William Gropp; Keith D. Underwood

The Message Passing Interface (MPI) 3.0 standard, introduced in September 2012, includes a significant update to the one-sided communication interface, also known as remote memory access (RMA). In particular, the interface has been extended to better support popular one-sided and global-address-space parallel programming models to provide better access to hardware performance features and enable new data-access modes. We present the new RMA interface and specify formal axiomatic models for data consistency and access semantics. Such models can help users reason about details of the semantics that are hard to extract from the English prose in the standard. It also fosters the development of tools and compilers, enabling them to automatically analyze, optimize, and debug RMA programs.

international parallel and distributed processing symposium | 2012

PARDA: A Fast Parallel Reuse Distance Analysis Algorithm

Qingpeng Niu; James Dinan; Qingda Lu; P. Sadayappan

Reuse distance is a well established approach to characterizing data cache locality based on the stack histogram model. This analysis so far has been restricted to offline use due to the high cost, often several orders of magnitude larger than the execution time of the analyzed code. This paper presents the first parallel algorithm to compute accurate reuse distances by analysis of memory address traces. The algorithm uses a tunable parameter that enables faster analysis when the maximum needed reuse distance is limited by a cache size upper bound. Experimental evaluation using the SPEC CPU 2006 benchmark suite shows that, using 64 processors and a cache bound of 8 MB, it is possible to perform reuse distance analysis with full accuracy within a factor of 13 to 50 times the original execution times of the benchmarks.

international parallel and distributed processing symposium | 2012

Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication

James Dinan; Pavan Balaji; Jeffrey R. Hammond; Sriram Krishnamoorthy; Vinod Tipparaju

The industry-standard Message Passing Interface (MPI) provides one-sided communication functionality and is available on virtually every parallel computing system. However, it is believed that MPIs one-sided model is not rich enough to support higher-level global address space parallel programming models. We present the first successful application of MPI one-sided communication as a runtime system for a PGAS model, Global Arrays (GA). This work has an immediate impact on users of GA applications, such as NW Chem, who often must wait several months to a year or more before GA becomes available on a new architecture. We explore challenges present in the application of MPI-2 to PGAS models and motivate new features in the upcoming MPI-3 standard. The performance of our system is evaluated on several popular high-performance computing architectures through communication benchmarking and application benchmarking using the NW Chem computational chemistry suite.

Proceedings of the 20th European MPI Users' Group Meeting on | 2013

Enabling MPI interoperability through flexible communication endpoints

James Dinan; Pavan Balaji; David Goodell; Douglas R. Miller; Marc Snir; Rajeev Thakur

The current MPI model defines a one-to-one relationship between MPI processes and MPI ranks. This model captures many use cases effectively, such as one MPI process per core and one MPI process per node. However, this semantic has limited interoperability between MPI and other programming models that use threads within a node. In this paper, we describe an extension to MPI that introduces communication endpoints as a means to relax the one-to-one relationship between processes and threads. Endpoints enable a greater degree interoperability between MPI and other programming models, and we illustrate their potential for additional performance and computation management benefits through the decoupling of ranks from processes.

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface | 2012

Leveraging MPI's one-sided communication interface for shared-memory programming

Torsten Hoefler; James Dinan; Darius Buntinas; Pavan Balaji; Brian W. Barrett; Ron Brightwell; William Gropp; Vivek Kale; Rajeev Thakur

Hybrid parallel programming with MPI for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While this model provides a great deal of flexibility and performance potential, it saddles programmers with the complexity of utilizing two parallel programming systems in the same application. We introduce an MPI-integrated shared-memory programming model that is incorporated into MPI through a small extension to the one-sided communication interface. We discuss the integration of this interface with the upcoming MPI 3.0 one-sided semantics and describe solutions for providing portable and efficient data sharing, atomic operations, and memory consistency. We describe an implementation of the new interface in the MPICH2 and Open MPI implementations and demonstrate an average performance improvement of 40% to the communication component of a five-point stencil solver.

international conference on cluster computing | 2012

Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments

John Jenkins; James Dinan; Pavan Balaji; Nagiza F. Samatova; Rajeev Thakur

Lack of efficient and transparent interaction with GPU data in hybrid MPI+GPU environments challenges GPU acceleration of large-scale scientific computations. A particular challenge is the transfer of noncontiguous data to and from GPU memory. MPI implementations currently do not provide an efficient means of utilizing data types for noncontiguous communication of data in GPU memory. To address this gap, we present an MPI data type-processing system capable of efficiently processing arbitrary data types directly on the GPU. We present a means for converting conventional data type representations into a GPU-amenable format. Fine-grained, element-level parallelism is then utilized by a GPU kernel to perform in-device packing and unpacking of noncontiguous elements. We demonstrate a several-fold performance improvement for noncontiguous column vectors, 3D array slices, and 4D array sub volumes over CUDA-based alternatives. Compared with optimized, layout-specific implementations, our approach incurs low overhead, while enabling the packing of data types that do not have a direct CUDA equivalent. These improvements are demonstrated to translate to significant improvements in end-to-end, GPU-to-GPU communication time. In addition, we identify and evaluate communication patterns that may cause resource contention with packing operations, providing a baseline for adaptively selecting data-processing strategies.

ieee international conference on high performance computing, data, and analytics | 2016

Mitigating MPI Message Matching Misery

Mario Flajslik; James Dinan; Keith D. Underwood

To satisfy MPI ordering semantics in the presence of wildcards, current implementations store posted receive operations and unexpected messages in linked lists. As applications scale up, communication patterns that scale with the number of processes or the number of threads per process can cause those linked lists to grow and become a performance problem. We propose new structures and matching algorithms to address these performance challenges. Our scheme utilizes a hash map that is extended with message ordering annotations to significantly reduce time spent searching for matches in the posted receive and the unexpected message structures. At the same time, we maintain the required MPI ordering semantics, even in the presence of wildcards. We evaluate our approach on several benchmarks and demonstrate a significant reduction in the number of unsuccessful match attempts in the MPI message processing engine, while at the same time incurring low space and time overheads.

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface | 2011

Noncollective communicator creation in MPI

James Dinan; Sriram Krishnamoorthy; Pavan Balaji; Jeff R. Hammond; Manojkumar Krishnan; Vinod Tipparaju; Abhinav Vishnu

MPI communicators abstract communication operations across application modules, facilitating seamless composition of different libraries. In addition, communicators provide the ability to form groups of processes and establish multiple levels of parallelism. Traditionally, communicators have been collectively created in the context of the parent communicator. The recent thrust toward systems at petascale and beyond has brought forth new application use cases, including fault tolerance and load balancing, that highlight the ability to construct an MPI communicator in the context of its new process group as a key capability. However, it has long been believed that MPI is not capable of allowing the user to form a new communicator in this way. We present a new algorithm that allows the user to create such flexible process groups using only the functionality given in the current MPI standard. We explore performance implications of this technique and demonstrate its utility for load balancing in the context of a Markov chain Monte Carlo computation. In comparison with a traditional collective approach, noncollective communicator creation enables a 30% improvement in execution time through asynchronous load balancing.

ieee international conference on high performance computing data and analytics | 2014

Enabling communication concurrency through flexible MPI endpoints

James Dinan; Ryan E. Grant; Pavan Balaji; David Goodell; Douglas R. Miller; Marc Snir; Rajeev Thakur

MPI defines a one-to-one relationship between MPI processes and ranks. This model captures many use cases effectively; however, it also limits communication concurrency and interoperability between MPI and programming models that utilize threads. This paper describes the MPI endpoints extension, which relaxes the longstanding one-to-one relationship between MPI processes and ranks. Using endpoints, an MPI implementation can map separate communication contexts to threads, allowing them to drive communication independently. Endpoints also enable threads to be addressable in MPI operations, enhancing interoperability between MPI and other programming models. These characteristics are illustrated through several examples and an empirical study that contrasts current multithreaded communication performance with the need for high degrees of communication concurrency to achieve peak communication performance.

Explore More