Tze Meng Low
Carnegie Mellon University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Tze Meng Low.
ACM Transactions on Mathematical Software | 2016
Field G. Van Zee; Tyler M. Smith; Bryan Marker; Tze Meng Low; Robert A. van de Geijn; Francisco D. Igual; Mikhail Smelyanskiy; Xianyi Zhang; Michael Kistler; Vernon Austel; John A. Gunnels; Lee Killough
BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures. The systems for which we demonstrate the framework include state-of-the-art general-purpose, low-power, and many-core architectures. We show, with very little effort, how the BLIS framework yields sequential and parallel implementations that are competitive with the performance of ATLAS, OpenBLAS (an effort to maintain and extend the GotoBLAS), and commercial vendor implementations such as AMD’s ACML, IBM’s ESSL, and Intel’s MKL libraries. Although most of this article focuses on single-core implementation, we also provide compelling results that suggest the framework’s leverage extends to the multithreaded domain.
ACM Transactions on Mathematical Software | 2006
Thierry Joffrain; Tze Meng Low; Enrique S. Quintana-Ortí; Robert A. van de Geijn; Field G. Van Zee
A theorem related to the accumulation of Householder transformations into a single orthogonal transformation known as the compact WY transform is presented. It provides a simple characterization of the computation of this transformation and suggests an alternative algorithm for computing it. It also suggests an alternative transformation, the UT transform, with the same utility as the compact WY Transform which requires less computation and has similar stability properties. That alternative transformation was first published over a decade ago but has gone unnoticed by the community.
ACM Transactions on Mathematical Software | 2016
Tze Meng Low; Francisco D. Igual; Tyler M. Smith; Enrique S. Quintana-Ortí
We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine tuning parameters for high-end instantiations of the matrix-matrix multiplication. This is of both practical and scientific importance, as it greatly reduces the development effort required for the implementation of the level-3 BLAS while also advancing our understanding of how hierarchically layered memories interact with high-performance software. This allows the community to move on from valuable engineering solutions (empirically autotuning) to scientific understanding (analytical insight).
ACM Transactions on Mathematical Software | 2008
Field G. Van Zee; Paolo Bientinesi; Tze Meng Low; Robert A. van de Geijn
We discuss the OpenMP parallelization of linear algebra algorithms that are coded using the Formal Linear Algebra Methods Environment (FLAME) API. This API expresses algorithms at a higher level of abstraction, avoids the use loop and array indices, and represents these algorithms as they are formally derived and presented. We report on two implementations of the workqueuing model, neither of which requires the use of explicit indices to specify parallelism. The first implementation uses the experimental taskq pragma, which may influence the adoption of a similar construct into OpenMP 3.0. The second workqueuing implementation is domain-specific to FLAME but allows us to illustrate the benefits of sorting tasks according to their computational cost prior to parallel execution. In addition, we discuss how scalable parallelization of dense linear algebra algorithms via OpenMP will require a two-dimensional partitioning of operands much like a 2D data distribution is needed on distributed memory architectures. We illustrate the issues and solutions by discussing the parallelization of the symmetric rank-k update and report impressive performance on an SGI system with 14 Itanium2 processors.
SIAM Journal on Scientific Computing | 2014
Martin D. Schatz; Tze Meng Low; Robert A. van de Geijn; Tamara G. Kolda
Symmetric tensor operations arise in a wide variety of computations. However, the benefits of exploiting symmetry in order to reduce storage and computation is in conflict with a desire to simplify memory access patterns. In this paper, we propose a blocked data structure (blocked compact symmetric storage) wherein we consider the tensor by blocks and store only the unique blocks of a symmetric tensor. We propose an algorithm by blocks, already shown of benefit for matrix computations, that exploits this storage format by utilizing a series of temporary tensors to avoid redundant computation. Further, partial symmetry within temporaries is exploited to further avoid redundant storage and redundant computation. A detailed analysis shows that, relative to storing and computing with tensors without taking advantage of symmetry and partial symmetry, storage requirements are reduced by a factor of
acm sigplan symposium on principles and practice of parallel programming | 2005
Tze Meng Low; Robert A. van de Geijn; Field G. Van Zee
O( m! )
IEEE Control Systems Magazine | 2017
Franz Franchetti; Tze Meng Low; Stefan Mitsch; Juan Pablo Mendoza; Liangyan Gui; Amarin Phaosawasdi; David A. Padua; Soummya Kar; José M. F. Moura; Michael Franusich; Jeremy R. Johnson; André Platzer; Manuela M. Veloso
and computational requirements by a factor of
international parallel and distributed processing symposium | 2016
Nikolaos Alachiotis; Thom Popovici; Tze Meng Low
O( (m+1)!/2^m )
ieee high performance extreme computing conference | 2016
Richard Veras; Tze Meng Low; Franz Franchetti
, where
international symposium on microarchitecture | 2015
Qi Guo; Tze Meng Low; Nikolaos Alachiotis; Berkin Akin; Lawrence T. Pileggi; James C. Hoe; Franz Franchetti
m