Is this you? Create Your Porfile

Xiaoye S. Li

Lawrence Berkeley National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiaoye S. Li is active.

Explore More

Publication

Featured researches published by Xiaoye S. Li.

SIAM Journal on Matrix Analysis and Applications | 1999

A Supernodal Approach to Sparse Partial Pivoting

James Demmel; Stanley C. Eisenstat; John R. Gilbert; Xiaoye S. Li; Joseph W. H. Liu

We investigate several ways to improve the performance of sparse LU factorization with partial pivoting, as used to solve unsymmetric linear systems. We introduce the notion of unsymmetric supernodes to perform most of the numerical computation in dense matrix kernels. We introduce unsymmetric supernode-panel updates and two-dimensional data partitioning to better exploit the memory hierarchy. We use Gilbert and Peierlss depth-first search with Eisenstat and Lius symmetric structural reductions to speed up symbolic factorization. We have developed a sparse LU code using all these ideas. We present experiments demonstrating that it is significantly faster than earlier partial pivoting codes. We also compare its performance with UMFPACK, which uses a multifrontal approach; our code is very competitive in time and storage requirements, especially for large problems.

ACM Transactions on Mathematical Software | 2003

SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems

Xiaoye S. Li; James Demmel

We present the main algorithmic features in the software package SuperLU_DIST, a distributed-memory sparse direct solver for large sets of linear equations. We give in detail our parallelization strategies, with a focus on scalability issues, and demonstrate the softwares parallel performance and scalability on current machines. The solver is based on sparse Gaussian elimination, with an innovative static pivoting strategy proposed earlier by the authors. The main advantage of static pivoting over classical partial pivoting is that it permits a priori determination of data structures and communication patterns, which lets us exploit techniques used in parallel sparse Cholesky algorithms to better parallelize both LU decomposition and triangular solution on large-scale distributed machines.

SIAM Journal on Matrix Analysis and Applications | 1999

An Asynchronous Parallel Supernodal Algorithm for Sparse Gaussian Elimination

James Demmel; John R. Gilbert; Xiaoye S. Li

Although Gaussian elimination with partial pivoting is a robust algorithm to solve unsymmetric sparse linear systems of equations, it is difficult to implement efficiently on parallel machines, because of its dynamic and somewhat unpredicitable way of generating work and intermediate results at run time. In this paper, we present an efficient parallel algorithm that overcomes this difficulty. The high performance of our algorithm is achieved through (1) using a graph reduction technique and a supernode-panel computational kernel for high single processor utilization, and (2) scheduling two types of parallel tasks for a high level of concurrency. One such task is factoring the independent panels on the disjoint subtree in the column elimination tree of A. Another task is updating a panel by previously computed supernodes. A scheduler assigns tasks to free processors dynamically and facilitates the smooth transition between the two types of parallel tasks. No global synchronization is used in the algorithm. The algorithm is well suited for shared memory machines (SMP) with a modest number of processors. We demonstrate 4-7 fold speedups on a range of 8 processor SMPs, and more on larger SMPs. One realistic problem arising from a 3-D flow calculation achieves factorization rates of 1.0, 2.5, 0.8 and 0.8 Gigaflops, on the 12 processor Power Challenge, 8 processor Cray C90, 16 processor Cray J90, and 8 processor AlphaServer 8400 respectively.

Lawrence Berkeley National Laboratory | 1997

SuperLU Users'' Guide

James Demmel; John R. Gilbert; Xiaoye S. Li

This document describes a collection of three related ANSI C subroutine libraries for solving sparse linear systems of equations AX = B: Here A is a square, nonsingular, n x n sparse matrix, and X and B are dense n x nrhs matrices, where nrhs is the number of right-hand sides and solution vectors. Matrix A need not be symmetric or definite; indeed, SuperLU is particularly appropriate for matrices with very unsymmetric structure. All three libraries use variations of Gaussian elimination optimized to take advantage both of sparsity and the computer architecture, in particular memory hierarchies (caches) and parallelism.

SIAM Journal on Matrix Analysis and Applications | 2009

Superfast Multifrontal Method for Large Structured Linear Systems of Equations

Jianlin Xia; Shivkumar Chandrasekaran; Ming Gu; Xiaoye S. Li

In this paper we develop a fast direct solver for large discretized linear systems using the supernodal multifrontal method together with low-rank approximations. For linear systems arising from certain partial differential equations such as elliptic equations, during the Gaussian elimination of the matrices with proper ordering, the fill-in has a low-rank property: all off-diagonal blocks have small numerical ranks with proper definition of off-diagonal blocks. Matrices with this low-rank property can be efficiently approximated with semiseparable structures called hierarchically semiseparable (HSS) representations. We reveal the above low-rank property by ordering the variables with nested dissection and eliminating them with the multifrontal method. All matrix operations in the multifrontal method are performed in HSS forms. We present efficient ways to organize the HSS structured operations along the elimination. Some fast HSS matrix operations using tree structures are proposed. This new structured multifrontal method has nearly linear complexity and a linear storage requirement. Thus, we call it a superfast multifrontal method. It is especially suitable for large sparse problems and also has natural adaptability to parallel computations and great potential to provide effective preconditioners. Numerical results demonstrate the efficiency.

Numerical Linear Algebra With Applications | 2010

Fast algorithms for hierarchically semiseparable matrices

Jianlin Xia; Shivkumar Chandrasekaran; Ming Gu; Xiaoye S. Li

Semiseparable matrices and many other rank-structured matrices have been widely used in developing new fast matrix algorithms. In this paper, we generalize the hierarchically semiseparable (HSS) matrix representations and propose some fast algorithms for HSS matrices. We represent HSS matrices in terms of general binary HSS trees and use simplified postordering notation for HSS forms. Fast HSS algorithms including new HSS structure generation and HSS form Cholesky factorization are developed. Moreover, we provide a new linear complexity explicit ULV factorization algorithm for symmetric positive definite HSS matrices with a low-rank property. The corresponding factors can be used to solve the HSS systems also in linear complexity. Numerical examples demonstrate the efficiency of the algorithms. All these algorithms have nice data locality. They are useful in developing fast-structured numerical methods for large discretized PDEs (such as elliptic equations), integral equations, eigenvalue problems, etc. Some applications are shown. Copyright q 2009 John Wiley & Sons, Ltd.

Other Information: PBD: 1 Sep 2002 | 2002

ARPREC: An arbitrary precision computation package

David H. Bailey; Hida Yozo; Xiaoye S. Li; Brandon Thompson

This paper describes a new software package for performing arithmetic with an arbitrarily high level of numeric precision. It is based on the earlier MPFUN package, enhanced with special IEEE floating-point numerical techniques and several new functions. This package is written in C++ code for high performance and broad portability and includes both C++ and Fortran-90 translation modules, so that conventional C++ and Fortran-90 programs can utilize the package with only very minor changes. This paper includes a survey of some of the interesting applications of this package and its predecessors.

Experimental Mathematics | 2005

A comparison of three high-precision quadrature schemes

David H. Bailey; Karthik Jeyabalan; Xiaoye S. Li

The authors have implemented three numerical quadrature schemes, using the Arbitrary Precision (ARPREC) software package. The objective here is a quadrature facility that can efficiently evaluate to very high precision a large class of integrals typical of those encountered in experimental mathematics, relying on a minimum of a priori information regarding the function to be integrated. Such a facility is useful, for example, to permit the experimental identification of definite integrals based on their numerical values. The performance and accuracy of these three quadrature schemes are compared using a suite of 15 integrals, ranging from continuous, well-behaved functions on finite intervals to functions with infinite derivatives and blow-up singularities at endpoints, as well as several integrals on an infinite interval. In results using 412-digit arithmetic, we achieve at least 400-digit accuracy, using two of the programs, for all problems except one highly oscillatory function on an infinite interval. Similar results were obtained using 1,012-digit arithmetic.

ACM Transactions on Mathematical Software | 2006

Error bounds from extra-precise iterative refinement

James Demmel; Yozo Hida; William Kahan; Xiaoye S. Li; Sonil Mukherjee; E. Jason Riedy

We present the design and testing of an algorithm for iterative refinement of the solution of linear equations where the residual is computed with extra precision. This algorithm was originally proposed in 1948 and analyzed in the 1960s as a means to compute very accurate solutions to all but the most ill-conditioned linear systems. However, two obstacles have until now prevented its adoption in standard subroutine libraries like LAPACK: (1) There was no standard way to access the higher precision arithmetic needed to compute residuals, and (2) it was unclear how to compute a reliable error bound for the computed solution. The completion of the new BLAS Technical Forum Standard has essentially removed the first obstacle. To overcome the second obstacle, we show how the application of iterative refinement can be used to compute an error bound in any norm at small cost and use this to compute both an error bound in the usual infinity norm, and a componentwise relative error bound.We report extensive test results on over 6.2 million matrices of dimensions 5, 10, 100, and 1000. As long as a normwise (componentwise) condition number computed by the algorithm is less than 1/max{10,&nsqrt;}ϵw, the computed normwise (componentwise) error bound is at most 2 max{10, &nsqrt;} · ϵw, and indeed bounds the true error. Here, n is the matrix dimension and ϵw = 2−24 is the working precision. Residuals were computed in double precision (53 bits of precision). In other words, the algorithm always computed a tiny error at negligible extra cost for most linear systems. For worse conditioned problems (which we can detect using condition estimation), we obtained small correct error bounds in over 90% of cases.

Siam Review | 2002

Effects of Ordering Strategies and Programming Paradigms on Sparse Matrix Computations

Leonid Oliker; Xiaoye S. Li; Parry Husbands; Rupak Biswas

The conjugate gradient (CG) algorithm is perhaps the best-known iterative technique for solving sparse linear systems that are symmetric and positive definite. For systems that are ill conditioned, it is often necessary to use a preconditioning technique. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and ILU(0) preconditioned CG (PCG) using different programming paradigms and architectures. Results show that for this class of applications, ordering significantly improves overall performance on both distributed and distributed shared-memory systems, cache reuse may be more important than reducing communication, it is possible to achieve message-passing performance using shared-memory constructs through careful data ordering and distribution, and a hybrid MPI + OpenMP paradigm increases programming complexity with little performance gain. A multithreaded implementation of CG on the Cray MTA does not require special ordering or partitioning to obtain high efficiency and scalability, giving it a distinct advantage for adaptive applications; however, it shows limited scalability for PCG due to a lack of thread-level parallelism.

Explore More