Ernie Chan
University of Texas at Austin
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ernie Chan.
ACM Transactions on Mathematical Software | 2009
Gregorio Quintana-Ortí; Enrique S. Quintana-Ortí; Robert A. van de Geijn; Field G. Van Zee; Ernie Chan
With the emergence of thread-level parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions. We propose a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain. The first abstraction, FLASH, allows algorithms to express computation with matrices consisting of contiguous blocks, facilitating algorithms-by-blocks. Operand descriptions are registered for a particular operation a priori by the library implementor. A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads out-of-order and executed in parallel. But not all classical algorithms in linear algebra lend themselves to conversion to algorithms-by-blocks. We show how our recently proposed LU factorization with incremental pivoting and a closely related algorithm-by-blocks for the QR factorization, both originally designed for out-of-core computation, overcome this difficulty. Anecdotal evidence regarding the development of routines with a core functionality demonstrates how the methodology supports high productivity while experimental results suggest that high performance is abundantly achievable.
acm symposium on parallel algorithms and architectures | 2007
Ernie Chan; Enrique S. Quintana-Ortí; Gregorio Quintana-Ortí; Robert A. van de Geijn
We discuss the high-performance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures, with an eye towards multi-core processors with many cores. We argue that traditional implementations, as those incorporated in LAPACK, cannot be easily modified to render high performance as well as scalability on these architectures. The solution we propose is to arrange the data structures and algorithms so that matrix blocks become the fundamental units of data, and operations on these blocks become the fundamental units of computation, resulting in algorithms-by-blocks as opposed to the more traditional blocked algorithms. We show that this facilitates the adoption of techniques akin to dynamic scheduling and out-of-order execution usual in superscalar processors, which we name SuperMatrix Out-of-Order scheduling. Performance results on a 16 CPU Itanium2-based server are used to highlight opportunities and issues related to this new approach.
Concurrency and Computation: Practice and Experience | 2007
Ernie Chan; Marcel Heimlich; Avi Purkayastha; Robert A. van de Geijn
We discuss the design and high‐performance implementation of collective communications operations on distributed‐memory computer architectures. Using a combination of known techniques (many of which were first proposed in the 1980s and early 1990s) along with careful exploitation of communication modes supported by MPI, we have developed implementations that have improved performance in most situations compared to those currently supported by public domain implementations of MPI such as MPICH. Performance results from a large Intel Xeon/Pentium 4 (R) processor cluster are included. Copyright
acm sigplan symposium on principles and practice of parallel programming | 2008
Ernie Chan; Field G. Van Zee; Paolo Bientinesi; Enrique S. Quintana-Ortí; Gregorio Quintana-Ortí; Robert A. van de Geijn
This paper describes SuperMatrix, a runtime system that parallelizes matrix operations for SMP and/or multi-core architectures. We use this system to demonstrate how code described at a high level of abstraction can achieve high performance on such architectures while completely hiding the parallelism from the library programmer. The key insight entails viewing matrices hierarchically, consisting of blocks that serve as units of data where operations over those blocks are treated as units of computation. The implementation transparently enqueues the required operations, internally tracking dependencies, and then executes the operations utilizing out-of-order execution techniques inspired by superscalar microarchitectures. This separation of concerns allows library developers to implement algorithms without concerning themselves with the parallelization aspect of the problem. Different heuristics for scheduling operations can be implemented in the runtime system independent of the code that enqueues the operations. Results gathered on a 16 CPU ccNUMA Itanium2 server demonstrate excellent performance.
acm sigplan symposium on principles and practice of parallel programming | 2006
Ernie Chan; Robert A. van de Geijn; William Gropp; Rajeev Thakur
Traditional collective communication algorithms are designed with the assumption that a node can communicate with only one other node at a time. On new parallel architectures such as the IBM Blue Gene/L, a node can communicate with multiple nodes simultaneously. We have redesigned and reimplemented many of the MPI collective communication algorithms to take advantage of this ability to send simultaneously, including broadcast, reduce(-to-one), scatter, gather, allgather, reduce-scatter, and allreduce. We show that these new algorithms have lower expected costs than the previously known lower bounds based on old models of parallel computation. Results are included comparing their performance to the default implementations in IBMs MPI.
Computing in Science and Engineering | 2009
Field G. Van Zee; Ernie Chan; Robert A. van de Geijn; Enrique S. Quintana-Ortí; Gregorio Quintana-Ortí
Researchers from the Formal Linear Algebra Method Environment (Flame) project have developed new methodologies for analyzing, designing, and implementing linear algebra libraries. These solutions, which have culminated in the libflame library, seem to solve many of the programmability problems that have arisen with the advent of multicore and many-core architectures.
international conference on cluster computing | 2004
Ernie Chan; Marcel Heimlich; Avi Purkayastha; R.A. van de Geijn
We discuss issues related to the high-performance implementation of collective communications operations on distributed-memory computer architectures. Using a combination of known techniques (many of which were first proposed in the 1980s and early 1990s) along with careful exploitation of communication modes supported by MPI, we have developed implementations that have improved performance in most situations compared to those currently supported by public domain implementations of MPI such as MPICH. Performance results from a large Intel Pentium 4 (R) processor cluster are included.
parallel, distributed and network-based processing | 2008
Gregorio Quintana-Ortí; Enrique S. Quintana-Ortí; Ernie Chan; R.A. van de Geijn; F.G. Van Zee
This paper examines the scalable parallel implementation of the QR factorization of a general matrix, targeting SMP and multi-core architectures. Two implementations of algorithms-by-blocks are presented. Each implementation views a block of a matrix as the fundamental unit of data, and likewise, operations over these blocks as the primary unit of computation. The first is a conventional blocked algorithm similar to those included in libFLAME and LAPACK but expressed in a way that allows operations in the so-called critical path of execution to be computed as soon as their dependencies are satisfied. The second algorithm captures a higher degree of parallelism with an approach based on Givens rotations while preserving the performance benefits of algorithms based on blocked Householder transformations. We show that the implementation effort is greatly simplified by expressing the algorithms in code with the FLAME/FLASH API, which allows matrices stored by blocks to be viewed and managed as matrices of matrix blocks. The SuperMatrix run-time system utilizes FLASH to assemble and represent matrices but also provides out-of-order scheduling of operations that is transparent to the programmer. Scalability of the solution is demonstrated on ccNUMA platform with 16 processors and an SMP architecture with 16 cores.
international conference on cluster computing | 2007
Ernie Chan; F.G. Van Zee; Enrique S. Quintana-Ortí; Gregorio Quintana-Ortí; R.A. van de Geijn
SuperMatrix out-of-order scheduling leverages high-level abstractions and straightforward data dependency analysis to provide a general-purpose mechanism for obtaining parallelism from a wide range of linear algebra operations. Viewing submatrices as the fundamental unit of data allows us to decompose operations into component tasks that operate upon these submatrices. Data dependencies between tasks are determined by observing the submatrix blocks read from and written to by each task. We employ the same dynamic out-of-order execution techniques traditionally exploited by modern superscalar micro-architectures to execute tasks in parallel according to data dependencies within linear algebra operations. This paper provides a general explanation of the SuperMatrix implementation followed by empirical evidence of its broad applicability through performance results of several standard linear algebra operations on a wide range of computer architectures.
acm symposium on parallel algorithms and architectures | 2010
Ernie Chan; Robert A. van de Geijn; Andrew Chapman
We describe parallel implementations of LU factorization with pivoting for multicore architectures. Implementations that differ in two different dimensions are discussed: (1) using classical partial pivoting versus recently proposed incremental pivoting and (2) extracting parallelism only within the Basic Linear Algebra Subprograms versus building and scheduling a directed acyclic graph of tasks. Performance comparisons are given on two different systems.