Duane Merrill | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Duane Merrill is active.

Explore More

Publication

Featured researches published by Duane Merrill.

acm sigplan symposium on principles and practice of parallel programming | 2012

Scalable GPU graph traversal

Duane Merrill; Michael Garland; Andrew S. Grimshaw

Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter. We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.

international conference on parallel architectures and compilation techniques | 2010

Revisiting sorting for GPGPU stream architectures

Duane Merrill; Andrew S. Grimshaw

This poster presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Compared to the state-of-the-art, our radix sorting methods exhibit speedup of at least 2x for all generations of NVIDIA GPGPUs, and up to 3.7x for current GT200-based models. Our implementations demonstrate sorting rates of 482 million key-value pairs per second, and 550 million keys per second (32-bit). For this domain of sorting problems, we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture. These results motivate a different breed of parallel primitives for GPGPU stream architectures that can better exploit the memory and computational resources while maintaining the flexibility of a reusable component. Our sorting performance is derived from a parallel scan stream primitive that has been generalized in two ways: (1) with local interfaces for producer/consumer operations (visiting logic), and (2) with interfaces for performing multiple related, concurrent prefix scans (multi-scan).

parallel computing | 2015

High-Performance and Scalable GPU Graph Traversal

Duane Merrill; Michael Garland; Andrew S. Grimshaw

Breadth-First Search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with nontrivial diameter. We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum computations that achieves an asymptotically optimal O(|V| + |E|) gd work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single- and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations on both CPU and GPU platforms.

IEEE Computer | 2009

An Open Grid Services Architecture Primer

Andrew S. Grimshaw; Mark M. Morgan; Duane Merrill; Hiro Kishimoto; Andreas Savva; David Snelling; Chris Smith; Dave Berry

To expand the use of distributed computer infrastructures as well as facilitate grid interoperability, OGSA has developed standards and specifications that address a range of scenarios, including high-throughput computing, federated data management, and service mobility.

ieee international conference on high performance computing data and analytics | 2016

Merge-based parallel sparse matrix-vector multiplication

Duane Merrill; Michael Garland

We present a strictly balanced method for the parallel computation of sparse matrix-vector products (SpMV). Our algorithm operates directly upon the Compressed Sparse Row (CSR) sparse matrix format without preprocessing, inspection, reformatting, or supplemental encoding. Regardless of nonzero structure, our equitable 2D merge-based decomposition tightly bounds the workload assigned to each processing element. Furthermore, our technique is suitable for recursively partitioning CSR datasets themselves into multi-scale, distributed, NUMA, and GPU environments that are constrained by fixed-size local memories. We evaluate our method on both CPU and GPU microarchitectures across a very large corpus of diverse sparse matrix datasets. We show that traditional CsrMV methods are inconsistent performers, often subject to order-of-magnitude performance variation across similarly-sized datasets. In comparison, our method provides predictable performance that is substantially uncorrelated to the distribution of nonzeros among rows and broadly improves upon that of current CsrMV methods.

international parallel and distributed processing symposium | 2015

Optimizing Sparse Matrix Operations on GPUs Using Merge Path

Steven Dalton; Sean Baxter; Duane Merrill; Luke N. Olson; Michael Garland

Irregular computations on large workloads are a necessity in many areas of computational science. Mapping these computations to modern parallel architectures, such as GPUs, is particularly challenging because the performance often depends critically on the choice of data-structure and algorithm. In this paper, we develop a parallel processing scheme, based on Merge Path partitioning, to compute segmented row-wise operations on sparse matrices that exposes parallelism at the granularity of individual nonzero entries. Our decomposition achieves competitive performance across many diverse problems while maintaining predictable behaviour dependent only on the computational work and ameliorates the impact of irregularity. We evaluate the performance of three sparse kernels: Spiv, Spaded and Sperm. We show that our processing scheme for each kernel yields comparable performance to other schemes in many cases and our performance is highly correlated, nearly 1, to the computational work irrespective of the underlying structure of the matrices.

international conference on parallel architectures and compilation techniques | 2015

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures

Adam McLaughlin; Duane Merrill; Michael Garland; David A. Bader

Contemporary microprocessors use relaxed memory consistency models to allow for aggressive optimizations in hardware. This enhancement in performance comes at the cost of design complexity and verification effort. In particular, verifying an execution of a program against its systems memory consistency model is an NP-complete problem. Several graph-based approximations to this problem based on carefully constructed randomized test programs have been proposed in the literature, however, such approaches are sequential and execute slowly on large graphs of interest. Unfortunately, the ability to execute larger tests is tremendously important, since such tests enable one to expose bugs more quickly. Successfully executing more tests per unit time is also desirable, since it allows for one to check for a greater variety of errors in the memory subsystem by utilizing a more diverse set of tests. This paper improves upon existing work by introducing an algorithm that not only reduces the time complexity of the verification process, but also facilitates the development of parallel algorithms for solving these problems. We first show performance improvements from a sequential approach and gain further performance from parallel implementations in OpenMP and CUDA. For large tests of interest, our GPU implementation achieves an average application speedup of 26.36x over existing techniques in use at NVIDIA.

high performance distributed computing | 2006

Summary: Integration of Legacy Grid Systems with Emerging Grid Standards

Andrew S. Grimshaw; Woochul Kang; Duane Merrill; Mark M. Morgan

The open grid services architecture (OGSA) addresses the need for standardization of diverse grid services by defining a set of core capabilities and behaviors needed by loosely coupled, service-oriented grid architectures. These OGSA standards and interfaces are based on ubiquitous, platform-neutral, technologies like SOAP, XML, and Web services. This paper briefly describes our OGSA proxy implementation that interconnects legacy Legion grids with grids that support the emerging OGSA specifications. Specifically we have constructed proxy services that support OGSA-ByteIO, WS-Directory, and WS-Naming

Parallel Processing Letters | 2011