James Demmel
University of California, Berkeley
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by James Demmel.
Archive | 1997
L. S. Blackford; Jaeyoung Choi; Andrew J. Cleary; Eduardo F. D'Azevedo; James Demmel; Inderjit S. Dhillon; Jack J. Dongarra; Sven Hammarling; Greg Henry; Antoine Petitet; K. Stanley; David Walker; R. C. Whaley
Where you can find the scalapack users guide easily? Is it in the book store? On-line book store? are you sure? Keep in mind that you will find the book in this site. This book is very referred for you because it gives not only the experience but also lesson. The lessons are very valuable to serve for you, thats not about who are reading this scalapack users guide book. It is about this book that will give wellness for all people from many societies.
Archive | 2000
James Demmel; Jack J. Dongarra; Axel Ruhe; Henk A. van der Vorst; Zhaojun Bai
List of symbols and acronyms List of iterative algorithm templates List of direct algorithms List of figures List of tables 1: Introduction 2: A brief tour of Eigenproblems 3: An introduction to iterative projection methods 4: Hermitian Eigenvalue problems 5: Generalized Hermitian Eigenvalue problems 6: Singular Value Decomposition 7: Non-Hermitian Eigenvalue problems 8: Generalized Non-Hermitian Eigenvalue problems 9: Nonlinear Eigenvalue problems 10: Common issues 11: Preconditioning techniques Appendix: of things not treated Bibliography Index .
information processing in sensor networks | 2007
Sukun Kim; Shamim N. Pakzad; David E. Culler; James Demmel; Gregory L. Fenves; Steve Glaser; Martin Turon
A Wireless Sensor Network (WSN) for Structural Health Monitoring (SHM) is designed, implemented, deployed and tested on the 4200 ft long main span and the south tower of the Golden Gate Bridge (GGB). Ambient structural vibrations are reliably measured at a low cost and without interfering with the operation of the bridge. Requirements that SHM imposes on WSN are identified and new solutions to meet these requirements are proposed and implemented. In the GGB deployment, 64 nodes are distributed over the main span and the tower, collecting ambient vibrations synchronously at 1 kHz rate, with less than 10 mus jitter, and with an accuracy of 30 muG. The sampled data is collected reliably over a 46-hop network, with a bandwidth of 441 B/s at the 46th hop. The collected data agrees with theoretical models and previous studies of the bridge. The deployment is the largest WSN for SHM.
ieee international conference on high performance computing data and analytics | 2008
Vasily Volkov; James Demmel
We present performance results for dense linear algebra using recent NVIDIA GPUs. Our matrix-matrix multiply routine (GEMM) runs up to 60% faster than the vendors implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80--90% of the peak GEMM rate. Our parallel LU running on two GPUs achieves up to ~540 Gflop/s. These results are accomplished by challenging the accepted view of the GPU architecture and programming guidelines. We argue that modern GPUs should be viewed as multithreaded multicore vector units. We exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU. This study includes detailed benchmarking of the GPU memory system that reveals sizes and latencies of caches and TLB. We present a couple of algorithmic optimizations aimed at increasing parallelism and regularity in the problem that provide us with slightly higher performance.
SIAM Journal on Matrix Analysis and Applications | 1999
James Demmel; Stanley C. Eisenstat; John R. Gilbert; Xiaoye S. Li; Joseph W. H. Liu
We investigate several ways to improve the performance of sparse LU factorization with partial pivoting, as used to solve unsymmetric linear systems. We introduce the notion of unsymmetric supernodes to perform most of the numerical computation in dense matrix kernels. We introduce unsymmetric supernode-panel updates and two-dimensional data partitioning to better exploit the memory hierarchy. We use Gilbert and Peierlss depth-first search with Eisenstat and Lius symmetric structural reductions to speed up symbolic factorization. We have developed a sparse LU code using all these ideas. We present experiments demonstrating that it is significantly faster than earlier partial pivoting codes. We also compare its performance with UMFPACK, which uses a multifrontal approach; our code is very competitive in time and storage requirements, especially for large problems.
parallel computing | 2009
Krste Asanovic; Rastislav Bodik; James Demmel; Tony M. Keaveny; Kurt Keutzer; John Kubiatowicz; Nelson Morgan; David A. Patterson; Koushik Sen; John Wawrzynek; David Wessel; Katherine A. Yelick
Writing programs that scale with increasing numbers of cores should be as easy as writing programs for sequential computers.
ACM Transactions on Mathematical Software | 2002
L. Susan Blackford; Antoine Petitet; Roldan Pozo; Karin A. Remington; R. Clint Whaley; James Demmel; Jack J. Dongarra; Iain S. Duff; Sven Hammarling; Greg Henry; Michael A. Heroux; Linda Kaufman; Andrew Lumsdaine
L. SUSAN BLACKFORD Myricom, Inc. JAMES DEMMEL University of California, Berkeley JACK DONGARRA The University of Tennessee IAIN DUFF Rutherford Appleton Laboratory and CERFACS SVEN HAMMARLING Numerical Algorithms Group, Ltd. GREG HENRY Intel Corporation MICHAEL HEROUX Sandia National Laboratories LINDA KAUFMAN William Patterson University ANDREW LUMSDAINE Indiana University ANTOINE PETITET Sun Microsystems ROLDAN POZO National Institute of Standards and Technology KARIN REMINGTON The Center for Advancement of Genomics and R. CLINT WHALEY Florida State University
ACM Transactions on Mathematical Software | 2003
Xiaoye S. Li; James Demmel
We present the main algorithmic features in the software package SuperLU_DIST, a distributed-memory sparse direct solver for large sets of linear equations. We give in detail our parallelization strategies, with a focus on scalability issues, and demonstrate the softwares parallel performance and scalability on current machines. The solver is based on sparse Gaussian elimination, with an innovative static pivoting strategy proposed earlier by the authors. The main advantage of static pivoting over classical partial pivoting is that it permits a priori determination of data structures and communication patterns, which lets us exploit techniques used in parallel sparse Cholesky algorithms to better parallelize both LU decomposition and triangular solution on large-scale distributed machines.
Presented at: SciDAC 2005 Proceedings (Journal of Physics), San Francisco, CA, United States, Jun 26 - Jun 30, 2005 | 2005
Richard W. Vuduc; James Demmel; Katherine A. Yelick
The Optimized Sparse Kernel Interface (OSKI) is a collection of low-level primitives that provide automatically tuned computational kernels on sparse matrices, for use by solver libraries and applications. These kernels include sparse matrix-vector multiply and sparse triangular solve, among others. The primary aim of this interface is to hide the complex decision-making process needed to tune the performance of a kernel implementation for a particular users sparse matrix and machine, while also exposing the steps and potentially non-trivial costs of tuning at run-time. This paper provides an overview of OSKI, which is based on our research on automatically tuned sparse kernels for modern cache-based superscalar machines.
conference on high performance computing (supercomputing) | 1990
E. Anderson; Z. Bai; Jack J. Dongarra; A. Greenbaum; A. McKenney; J. Du Croz; S. Hammarling; James Demmel; Christian H. Bischof; Danny C. Sorensen
The goal of the LAPACK project is to design and implement a portable linear algebra library for efficient use on a variety of high-performance computers. The library is based on the widely used LINPACK and EISPACK packages for solving linear equations, eigenvalue problems, and linear least-squares problems, but extends their functionality in a number of ways. The major methodology for making the algorithms run faster is to restructure them to perform block matrix operations (e.g., matrix-matrix multiplication) in their inner loops. These block operations may be optimized to exploit the memory hierarchy of a specific architecture. The LAPACK project is also working on new algorithms that yield higher relative accuracy for a variety of linear algebra problems. >