Is this you? Create Your Porfile

Richard W. Vuduc

Lawrence Livermore National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Richard W. Vuduc is active.

Explore More

Publication

Featured researches published by Richard W. Vuduc.

conference on high performance computing (supercomputing) | 2007

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Samuel Williams; Leonid Oliker; Richard W. Vuduc; John Shalf; Katherine A. Yelick; James Demmel

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) - one of the most heavily used kernels in scientific computing - across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.

Presented at: SciDAC 2005 Proceedings (Journal of Physics), San Francisco, CA, United States, Jun 26 - Jun 30, 2005 | 2005

OSKI: A library of automatically tuned sparse matrix kernels

Richard W. Vuduc; James Demmel; Katherine A. Yelick

The Optimized Sparse Kernel Interface (OSKI) is a collection of low-level primitives that provide automatically tuned computational kernels on sparse matrices, for use by solver libraries and applications. These kernels include sparse matrix-vector multiply and sparse triangular solve, among others. The primary aim of this interface is to hide the complex decision-making process needed to tune the performance of a kernel implementation for a particular users sparse matrix and machine, while also exposing the steps and potentially non-trivial costs of tuning at run-time. This paper provides an overview of OSKI, which is based on our research on automatically tuned sparse kernels for modern cache-based superscalar machines.

conference on high performance computing (supercomputing) | 2002

Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply

Richard W. Vuduc; James Demmel; Katherine A. Yelick; Shoaib Kamil; Rajesh Nishtala; Benjamin C. Lee

We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpM×V), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits. Specifically, we develop upper and lower bounds on the performance (Mflop/s) of SpM×V when tuned using our previously proposed register blocking optimization. These bounds are based on the non-zero pattern in the matrix and the cost of basic memory operations, such as cache hits and misses. We evaluate our tuned implementations with respect to these bounds using hardware counter data on 4 different platforms and on test set of 44 sparse matrices. We find that we can often get within 20% of the upper bound, particularly on class of matrices from finite element modeling (FEM) problems; on non-FEM matrices, performance improvements of 2× are still possible. Lastly, we present new heuristic that selects optimal or near-optimal register block sizes (the key tuning parameters) more accurately than our previous heuristic. Using the new heuristic, we show improvements in SpM×V performance (Mflop/s) by as much as 2.5× over an untuned implementation. Collectively, our results suggest that future performance improvements, beyond those that we have already demonstrated for SpM×V, will come from two sources: (1) consideration of higher-level matrix structures (e.g. exploiting symmetry, matrix reordering, multiple register block sizes), and (2) optimizing kernels with more opportunity for data reuse (e.g. sparse matrix-multiple vector multiply, multiplication of AT A by a vector).

high performance computing and communications | 2005

Fast sparse matrix-vector multiplication by exploiting variable block structure

Richard W. Vuduc; Hyun Jin Moon

We improve the performance of sparse matrix-vector multiplication(SpMV) on modern cache-based superscalar machines when the matrix structure consists of multiple, irregularly aligned rectangular blocks. Matrices from finite element modeling applications often have this structure. We split the matrix, A, into a sum, A1 + A2 + ... + As, where each term is stored in a new data structure we refer to as unaligned block compressed sparse row (UBCSR) format. A classical approach which stores A in a BCSR can also reduce execution time, but the improvements may be limited because BCSR imposes an alignment of the matrix non-zeros that leads to extra work from filled-in zeros. Combining splitting with UBCSR reduces this extra work while retaining the generally lower memory bandwidth requirements and register-level tiling opportunities of BCSR. We show speedups can be as high as 2.1× over no blocking, and as high as 1.8× over BCSR as used in prior work on a set of application matrices. Even when performance does not improve significantly, split UBCSR usually reduces matrix storage.

Applicable Algebra in Engineering, Communication and Computing | 2007

When cache blocking of sparse matrix vector multiply works and why

Rajesh Nishtala; Richard W. Vuduc; James Demmel; Katherine A. Yelick

We present new performance models and more compact data structures for cache blocking when applied to sparse matrix-vector multiply (SpM × V). We extend our prior models by relaxing the assumption that the vectors fit in cache and find that the new models are accurate enough to predict optimum block sizes. In addition, we determine criteria that predict when cache blocking improves performance. We conclude with architectural suggestions that would make memory systems execute SpM × V faster.

international conference on computational science | 2001

Statistical Models for Automatic Performance Tuning

Richard W. Vuduc; James Demmel; Jeff A. Bilmes

Achieving peak performance from library subroutines usually requires extensive, machine-dependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate, at compile-time, by (1) generating a large number of possible implementations of a subroutine, and (2) selecting a fast implementation by an exhaustive, empirical search. This paper applies statistical techniques to exploit the large amount of performance data collected during the search. First, we develop a heuristic for stopping an exhaustive compiletime search early if a near-optimal implementation is found. Second, we show how to construct run-time decision rules, based on run-time inputs, for selecting from among a subset of the best implementations. We apply our methods to actual performance data collected by the PHiPAC tuning system for matrix multiply on a variety of hardware platforms.

workshop on i o in parallel and distributed systems | 2006

Improving distributed memory applications testing by message perturbation

Richard W. Vuduc; Martin Schulz; Daniel J. Quinlan; Bronis R. de Supinski; Andreas Sæbjørnsen

We present initial work on perturbation techniques that cause the manifestation of timing-related bugs in distributed memory Message Passing Interface (MPI)-based applications. These techniques improve the coverage of possible message orderings in MPI applications that rely on nondeterministic point-to-point communication and work with small processor counts to alleviate the need to test at larger scales. Using carefully designed model problems,we show that these techniques aid testing for problems that are often not easily reproduced when running on small fractions of the machine.Our perturbation layer, JITTERBUG builds on PN MPI an extension of the MPI profiling interface that supports multiple layers of profiling libraries. We discuss how JITTERBUG complements existing MPI checking tools through the PN MPI framework.We present opportunities to build additional tools that statically analyze and directly transform the source code to support testing and debugging MPI applications at reduced scale.

international conference on engineering of complex computer systems | 2007

Communicating Software Architecture using a Unified Single-View Visualization

Thomas Panas; Thomas Epperly; Daniel J. Quinlan; Andreas Sæbjørnsen; Richard W. Vuduc

Software is among the most complex human artifacts, and visualization is widely acknowledged as important to understanding software. In this paper, we consider the problem of understanding a software systems architecture through visualization. Whereas traditional visualizations use multiple stakeholder-specific views to present different kinds of task- specific information, we propose an additional visualization technique that unifies the presentation of various kinds of architecture-level information, thereby allowing a variety of stakeholders to quickly see and communicate current development, quality, and costs of a software system. For future empirical evaluation of multi-aspect, single-view architectural visualizations, we have implemented our idea in an existing visualization tool, Vizz3D. Our implementation includes techniques, such as the use of a city metaphor, that reduce visual complexity in order to support single-view visualizations of large-scale programs.

international conference on computational science | 2003

Memory hierarchy optimizations and performance bounds for sparse A T Ax

Richard W. Vuduc; Attila Gyulassy; James Demmel; Katherine A. Yelick

This paper presents uniprocessor performance optimizations, automatic tuning techniques, and an experimental analysis of the sparse matrix operation, y = AT Ax, where A is a sparse matrix and x, y are dense vectors. We describe an implementation of this computational kernel which brings A through the memory hierarchy only once, and which can be combined naturally with the register blocking optimization previously proposed in the Sparsity tuning system for sparse matrix-vector multiply. We evaluate these optimizations on a benchmark set of 44 matrices and 4 platforms, showing speedups of up to 4.2×. We also develop platform-specific upper-bounds on the performance of these implementations. We analyze how closely we can approach these bounds, and show when low-level tuning techniques (e.g., better instruction scheduling) are likely to yield a significant pay-off. Finally, we propose a hybrid off-line/run-time heuristic which in practice automatically selects nearoptimal values of the key tuning parameters, the register block sizes.

International workshop on semantics, applications, and implementation of program generation | 2000

Code generators for automatic tuning of numerical kernels : Experiences with FFTW position paper

Richard W. Vuduc; James Demmel

Achieving peak performance in important numerical kernels such as dense matrix multiply or sparse-matrix vector multiplication usually requires extensive, machine-dependent tuning by hand. In response, a number automatic tuning systems have been developed which typically operate by (1) generating multiple implementations of a kernel, and (2) empirically selecting an optimal implementation. One such system is FFTW (Fastest Fourier Transform in the West) for the discrete Fourier transform. In this paper, we review FFTW’s inner workings with an emphasis on its code generator, and report on our empirical evaluation of the system on two different hardware and compiler platforms. We then describe a number of our own extensions to the FFTW code generator that compute effcient discrete cosine transforms and show promising speed-ups over a vendor-tuned library. We also comment on current opportunities to develop tuning systems in the spirit of FFTW for other widely-used kernels.

Explore More