Fred G. Gustavson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Fred G. Gustavson is active.

Explore More

Publication

Featured researches published by Fred G. Gustavson.

Journal of Algorithms | 1980

Fast solution of toeplitz systems of equations and computation of Padé approximants

Richard P. Brent; Fred G. Gustavson; David Y. Y. Yun

Abstract We present two new algorithms, ADT and MDT, for solving order-n Toeplitz systems of linear equations Tz = b in time O(n log2 n) and space O(n). The fastest algorithms previously known, such as Trenchs algorithm, require time Ω(n 2 ) and require that all principal submatrices of T be nonsingular. Our algorithm ADT requires only that T be nonsingular. Both our algorithms for Toeplitz systems are derived from algorithms for computing entries in the Pade table for a given power series. We prove that entries in the Pade table can be computed by the Extended Euclidean Algorithm. We describe an algorithm EMGCD (Extended Middle Greatest Common Divisor) which is faster than the algorithm HGCD of Aho, Hopcroft and Ullman, although both require time O(n log2 n), and we generalize EMGCD to produce PRSDC (Polynomial Remainder Sequence Divide and Conquer) which produces any iterate in the PRS, not just the middle term, in time O(n log2 n). Applying PRSDC to the polynomials U0(x) = x2n+1 and U1(x) = a0 + a1x + … + a2nx2n gives algorithm AD (Anti-Diagonal), which computes any (m, p) entry along the antidiagonal m + p = 2n of the Pade table for U1 in time O(n log2 n). Our other algorithm, MD (Main-Diagonal), computes any diagonal entry (n, n) in the Pade table for a normal power series, also in time O(n log2 n). MD is related to Schonhages fast continued fraction algorithm. A Toeplitz matrix T is naturally associated with U1, and the (n, n) Pade approximation to U1 gives the first column of T−1. We show how a formula due to Trench can be used to compute the solution z of Tz = b in time O(n log n) from the first row and column of T−1. Thus, the Pade table algorithms AD and MD give O(n log2 n) Toeplitz algorithms ADT and MDT. Trenchs formula breaks down in certain degenerate cases, but in such cases a companion formula, the discrete analog of the Christoffel-Darboux formula, is valid and may be used to compute z in time O(n log2n) via the fast computation (by algorithm AD) of at most four Pade approximants. We also apply our results to obtain new complexity bounds for the solution of banded Toeplitz systems and for BCH decoding via Berlekamps algorithm.

Siam Review | 1984

Implementing linear algebra algorithms for dense matrices on a vector pipeline machine

Jack J. Dongarra; Fred G. Gustavson; Alan H. Karp

This paper examines common implementations of linear algebra algorithms, such as matrix-vector multiplication, matrix-matrix multiplication and the solution of linear equations. The different versions are examined for efficiency on a computer architecture which uses vector processing and has pipelined instruction execution. By using the advanced architectural features of such machines, one can usually achieve maximum performance, and tremendous improvements in terms of execution speed can be seen over conventional computers.

Ibm Journal of Research and Development | 1997

Recursion leads to automatic variable blocking for dense linear-algebra algorithms

Fred G. Gustavson

We describe some modifications of the LAPACK dense linear-algebra algorithms using recursion. Recursion leads to automatic variable blocking. LAPACKs level-2 versions transform into level-3 codes by using recursion. The new recursive codes are written in FORTRAN 77, which does not support recursion as a language feature. Gaussian elimination with partial pivoting and Cholesky factorization are considered. Very clear algorithms emerge with the use of recursion. The recursive codes do exactly the same computation as the LAPACK codes, and a single recursive code replaces both the level-2 and level-3 versions of the corresponding LAPACK codes. We present an analysis of the recursive algorithm in terms of both FLOP count and storage usage. The matrix operands are more squarish using recursion. The total area of the submatrices used in the recursive algorithm is less than the total area used by the LAPACK level-3 right-/left-looking algorithms. We quantify the difference; we also quantify how the FLOPS are computed. Also, we show that the algorithms exhibit high performance on RISC-type processors. In fact, except for small matrices, the recursive version outperforms the level-3 LAPACK versions of DGETRF and DPOTRF on an RS/6000 workstation. For the level-2 versions, the performance gain approaches a factor of 3. We also demonstrate that a change to the LAPACK DLASWP routine can improve the performance of both the recursive version and DGETRF by more than 15 percent.

ACM Transactions on Mathematical Software | 1978

Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition

Fred G. Gustavson

Let A and B be two sparse matrices whose orders are p by q and q by r. Their product C -A B requires N nontrlvial multiplications where 0 <_ N <_ pqr. The operation count of our algorithm is usually proportional to N; however, its worse case is O(p, r, NA, N) where N A is the number of elements in A This algorithm can be used to assemble the sparse matrix arising from a finite element problem from the basic elements, using ~-1 [order (g)]2 operations where m is the total number of basic elements and order(g) is the order of the ~th element matrix. The concept of an unordered merge plays a key role m obtaining our fast multiplication algorithm It forces us to accept an unordered sparse row-wise format as output for the product C The permuted transposition algorithm computes ( R A ) T i n O(p, q, NA) operations where R is a permutation matrix It also orders an unordered sparse row-wise representation. We can combine these algorithms to produce an O(M) algorithm to solve A x = b where M is the number of multiplications needed to factor A into L U

ACM Transactions on Mathematical Software | 2001

FLAME: Formal Linear Algebra Methods Environment

John A. Gunnels; Fred G. Gustavson; Greg Henry; Robert A. van de Geijn

Since the advent of high-performance distributed-memory parallel computing, the need for intelligible code has become ever greater. The development and maintenance of libraries for these architectures is simply too complex to be amenable to conventional approaches to implementation. Attempts to employ traditional methodology have led, in our opinion, to the production of an abundance of anfractuous code that is difficult to maintain and almost impossible to upgrade.Having struggled with these issues for more than a decade, we have concluded that a solution is to apply a technique from theoretical computer science, formal derivation, to the development of high-performance linear algebra libraries. We think the resulting approach results in aesthetically pleasing, coherent code that greatly facilitates intelligent modularity and high performance while enhancing confidence in its correctness. Since the technique is language-independent, it lends itself equally well to a wide spectrum of programming languages (and paradigms) ranging from C and Fortran to C++ and Java. In this paper, we illustrate our observations by looking at the Formal Linear Algebra Methods Environment (FLAME), a framework that facilitates the derivation and implementation of linear algebra algorithms on sequential architectures. This environment demonstrates that lessons learned in the distributed-memory world can guide us toward better approaches even in the sequential world.We present performance experiments on the Intel (R) Pentium (R) III processor that demonstrate that high performance can be attained by coding at a high level of abstraction.

Siam Review | 2004

Recursive blocked algorithms and hybrid data structures for dense matrix library software

Erik Elmroth; Fred G. Gustavson; Isak Jonsson; Bo Kågström

Matrix computations are both fundamental and ubiquitous in computational science and its vast application areas. Along with the development of more advanced computer systems with complex memory hierarchies, there is a continuing demand for new algorithms and library software that efficiently utilize and adapt to new architecture features. This article reviews and details some of the recent advances made by applying the paradigm of recursion to dense matrix computations on todays memory-tiered computer systems. Recursion allows for efficient utilization of a memory hierarchy and generalizes existing fixed blocking by introducing automatic variable blocking that has the potential of matching every level of a deep memory hierarchy. Novel recursive blocked algorithms offer new ways to compute factorizations such as Cholesky and QR and to solve matrix equations. In fact, the whole gamut of existing dense linear algebra factorization is beginning to be reexamined in view of the recursive paradigm. Use of recursion has led to using new hybrid data structures and optimized superscalar kernels. The results we survey include new algorithms and library software implementations for level 3 kernels, matrix factorizations, and the solution of general systems of linear equations and several common matrix equations. The software implementations we survey are robust and show impressive performance on todays high performance computing systems.

Journal of the ACM | 1970

Symbolic Generation of an Optimal Crout Algorithm for Sparse Systems of Linear Equations

Fred G. Gustavson; Werner Liniger; Ralph A. Willoughby

An efficient implementation of the Crout elimination method in solving large sparse systems of linear algebraic equations of arbitrary structure is described. A computer program, GNSO, by symbolic processing, generates another program, SOLVE, which represents the optimal reduced Crout algorithm in the sense that only nonzero elements are stored and operated on. The method presented is particularly powerful when a system of fixed sparseness structure must be solved repeatedly with different numerical values. In practical examples, the execution of SOLVE was observed to be typically N times as fast as that of the full Crout algorithm, where N is the order of the system.

Ibm Journal of Research and Development | 2000

Applying recursion to serial and parallel QR factorization leads to better performance

Erik Elmroth; Fred G. Gustavson

We present new recursive serial and parallel algorithms for QR factorization of an m by n matrix. They improve performance. The recursion leads to an automatic variable blocking, and it also replaces a Level 2 part in a standard block algorithm with Level 3 operations. However, there are significant additional costs for creating and performing the updates, which prohibit the efficient use of the recursion for large n. We present a quantitative analysis of these extra costs. This analysis leads us to introduce a hybrid recursive algorithm that outperforms the LAPACK algorithm DGEQRF by about 20% for large square matrices and up to almost a factor of 3 for tall thin matrices. Uniprocessor performance results are presented for two IBM RS/6000® SP nodes-a 120-MHz IBM POWER2 node and one processor of a four-way 332-MHz IBM PowerPC® 604e SMP node. The hybrid recursive algorithm reaches more than 90% of the theoretical peak performance of the POWER2 node. Compared to standard block algorithms, the recursive approach also shows a significant advantage in the automatic tuning obtained from its automatic variable blocking. A successful parallel implementation on a four-way 332-MHz IBM PPC604e SMP node based on dynamic load balancing is presented. For two, three, and four processors it shows speedups of up to 1.97, 2.99, and 3.97.

workshop on i/o in parallel and distributed systems | 1996

The design and implementation of SOLAR, a portable library for scalable out-of-core linear algebra computations

Sivan Toledo; Fred G. Gustavson

SOLAR is a portable high-perfonnance library for out-of-core dense matrix computations. It combines portability with high perfonnance by using existing high-perfonnance in-core subroutine libraries and by using an optimized matrix input-output library. SOLAR works on parallel computers, workstations, and personal computers. It supports in-core computations on both shared-memory and distributed-memory machines, and its matrix input-output library supports both conventional 1/0 interfaces and parallel 110 interfaces. This paper discusses the overall design of SOLAR, its interfaces, and the design of several important subroutines. Experimental results show that SOLAR can factor on a single workstation an out-of-core positive-definite symmetric matrix at a rate exceeding 215 Mflops, and an out-of-core general matrix at a rate exceeding 195 Mflops. Less than 16% of the running time is spent on 110 in these computations. These results indicate that SOLARs portability does not compromise its perfonnance. We expect that the combination of portability, modularity, and the use of a high-level 110 interface will make the library an important platfonn for research on out-of-core algorithms and on parallel 110.

Archive | 1972

Some Basic Techniques for Solving Sparse Systems of Linear Equations

Fred G. Gustavson

The main character of this paper is tutorial. Yet it will serve as a survey of some of the sparse matrix methods used for solving Ax=b. And, in some sense the results are new.

Explore More