R.A. van de Geijn | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where R.A. van de Geijn is active.

Explore More

Publication

Featured researches published by R.A. van de Geijn.

ieee international conference on high performance computing data and analytics | 1994

Interprocessor collective communication library (InterCom)

Mike Barnett; Lance Shuler; R.A. van de Geijn; Satya Gupta; David G. Payne; Jerrell Watts

We outline a unified approach for building a library of collective communication operations that performs well on a cross-section of problems encountered in real applications. The target architecture is a two-dimensional mesh with worm-hole routing, but the techniques also apply to higher dimensional meshes and hypercubes. We stress a general approach, addressing the need for implementations that perform well for various sized vectors and grid dimensions, including non-power-of-two grids. This requires the development of general techniques for building hybrid algorithms. Finally, the approach also supports collective communication within a group of nodes, which is required by many scalable algorithms. Results from the Intel Paragon system are included.<<ETX>>

international parallel processing symposium | 1993

Global combine on mesh architectures with wormhole routing

Mike Barnett; R. Littlefield; David G. Payne; R.A. van de Geijn

Several algorithms are discussed for implementing global combine (summation) on distributed memory computers using a two-dimensional mesh interconnect with wormhole routing. These include algorithms that are asymptotically optimal for short vectors (O(log(p)) for p processing nodes) and for long vectors (O(n) for n data elements per node), as well as hybrid algorithms that are superior for intermediate n. Performance models are developed that include the effects of link conflicts and other characteristics of the underlying communication system. The models are validated using experimental data from the Intel Touchstone DELTA computer. Each of the combine algorithms is shown to be superior under some circumstances.<<ETX>>

conference on high performance computing (supercomputing) | 1993

Distributed memory matrix-vector multiplication and conjugate gradient algorithms

J.G. Lewis; R.A. van de Geijn

The critical bottlenecks in the implementation of the conjugate gradient algorithm on distributed memory computers are the communication requirements of the sparse matrix-vector multiply and of the vector recurrences. The data distribution and communication patterns of five general implementations whose realizations demonstrate that the cost of communication can be overcome to a much larger extent than is often assumed are described. The results also apply to more general settings for matrix-vector products, both sparse and dense.

Computer Methods in Applied Mechanics and Engineering | 1997

A parallel multifrontal algorithm and its implementation

P. Geng; J.T. Oden; R.A. van de Geijn

Abstract In this paper, we describe a multifrontal method for solving sparse systems of linear equations arising in finite element and finite difference methods. The method proposed in this study is a combination of the nested dissection ordering and the frontal method. It can significantly reduce the storage and computational time required by the conventional direct methods and is also a natural parallel algorithm. In addition, the method inherits major advantages of the frontal method, which include a simple interface with finite element codes and an effective data structure so that the entire computation is performed element by element on a series of small linear systems with dense stiffness matrices. The numerical implementation targets both distributed-memory machines as well as conventional sequential machines. Its performance is tested through a series of examples.

international conference on cluster computing | 2004

On optimizing collective communication

Ernie Chan; Marcel Heimlich; Avi Purkayastha; R.A. van de Geijn

We discuss issues related to the high-performance implementation of collective communications operations on distributed-memory computer architectures. Using a combination of known techniques (many of which were first proposed in the 1980s and early 1990s) along with careful exploitation of communication modes supported by MPI, we have developed implementations that have improved performance in most situations compared to those currently supported by public domain implementations of MPI such as MPICH. Performance results from a large Intel Pentium 4 (R) processor cluster are included.

parallel, distributed and network-based processing | 2008

Scheduling of QR Factorization Algorithms on SMP and Multi-Core Architectures

Gregorio Quintana-Ortí; Enrique S. Quintana-Ortí; Ernie Chan; R.A. van de Geijn; F.G. Van Zee

This paper examines the scalable parallel implementation of the QR factorization of a general matrix, targeting SMP and multi-core architectures. Two implementations of algorithms-by-blocks are presented. Each implementation views a block of a matrix as the fundamental unit of data, and likewise, operations over these blocks as the primary unit of computation. The first is a conventional blocked algorithm similar to those included in libFLAME and LAPACK but expressed in a way that allows operations in the so-called critical path of execution to be computed as soon as their dependencies are satisfied. The second algorithm captures a higher degree of parallelism with an approach based on Givens rotations while preserving the performance benefits of algorithms based on blocked Householder transformations. We show that the implementation effort is greatly simplified by expressing the algorithms in code with the FLAME/FLASH API, which allows matrices stored by blocks to be viewed and managed as matrices of matrix blocks. The SuperMatrix run-time system utilizes FLASH to assemble and represent matrices but also provides out-of-order scheduling of operations that is transparent to the programmer. Scalability of the solution is demonstrated on ccNUMA platform with 16 processors and an SMP architecture with 16 cores.

merged international parallel processing symposium and symposium on parallel and distributed processing | 1998

A flexible class of parallel matrix multiplication algorithms

John A. Gunnels; Calvin Lin; Greg Morrow; R.A. van de Geijn

This paper explains why parallel implementation of matrix multiplication, a seemingly simple algorithm that can be expressed as one statement and three nested loops, is complex. Practical algorithms that use matrix multiplication tend to use matrices of disparate shapes, and the shape of the matrices can significantly impact the performance of matrix multiplication. We provide a class of algorithms that covers the spectrum of shapes encountered and demonstrate that good performance can be attained if the right algorithm is chosen. While the paper resolves a number of issues, it concludes with discussion of a number of directions yet to be pursued.

international conference on cluster computing | 2007

Satisfying your dependencies with SuperMatrix

Ernie Chan; F.G. Van Zee; Enrique S. Quintana-Ortí; Gregorio Quintana-Ortí; R.A. van de Geijn

SuperMatrix out-of-order scheduling leverages high-level abstractions and straightforward data dependency analysis to provide a general-purpose mechanism for obtaining parallelism from a wide range of linear algebra operations. Viewing submatrices as the fundamental unit of data allows us to decompose operations into component tasks that operate upon these submatrices. Data dependencies between tasks are determined by observing the submatrix blocks read from and written to by each task. We employ the same dynamic out-of-order execution techniques traditionally exploited by modern superscalar micro-architectures to execute tasks in parallel according to data dependencies within linear algebra operations. This paper provides a general explanation of the SuperMatrix implementation followed by empirical evidence of its broad applicability through performance results of several standard linear algebra operations on a wide range of computer architectures.

international parallel and distributed processing symposium | 2001

Parallel out-of-core cholesky and QR factorizations with POOCLAPACK

Brian Christopher Gunter; Wesley Reiley; R.A. van de Geijn

In this paper the parallel implementation of out-of-core Cholesky factorization is used to introduce the Parallel Outof-Core Linear Algebra Package (POOCLAPACK), a flexible infrastructure for parallel implementation of out-ofcore linear algebra operations. POOCLAPACK builds on the Parallel Linear Algebra Package (PLAPACK) for incore parallel dense linear algebra computation. Despite the extreme simplicity of POOCLAPACK, the out-of-core Cholesky factorization implementation is shown to achieve up to 80% of peak performance on a 64 node configuration of the Cray T3E-600. The insights gained from examining the Cholesky factorization are also applied to the much more difficult and important QR factorization operation. Preliminary results for parallel implementation of the resulting OOC QR factorization algorithm are included.

ieee international conference on high performance computing data and analytics | 1994

Matrix-vector multiplication and conjugate gradient algorithms on distributed memory computers

J.G. Lewis; David G. Payne; R.A. van de Geijn

The critical bottlenecks in the implementation of the conjugate gradient algorithm on distributed memory computers are the communication requirements of the sparse matrix-vector multiply and of the vector recurrences. In a previous paper (G. Lewis et al., 1993), we described the data distribution and communication patterns of several implementations of parallel matrix-vector multiplication, demonstrating that on hypercubes, the cost of communication can be overcome to a much larger extent than is often assumed. In this paper, we generalize the best of those implementations to mesh architectures. We make no assumptions about the mesh being square or power-of-two. We also comment on the implications of our results for structured problems and on the scalability of our approach. Results are presented for the implementation of these algorithms on the Intel Touchstone Delta and Paragon mesh multicomputers.<<ETX>>

Explore More