Tze Meng Low | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tze Meng Low is active.

Explore More

Publication

Featured researches published by Tze Meng Low.

ACM Transactions on Mathematical Software | 2016

The BLIS Framework: Experiments in Portability

Field G. Van Zee; Tyler M. Smith; Bryan Marker; Tze Meng Low; Robert A. van de Geijn; Francisco D. Igual; Mikhail Smelyanskiy; Xianyi Zhang; Michael Kistler; Vernon Austel; John A. Gunnels; Lee Killough

BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures. The systems for which we demonstrate the framework include state-of-the-art general-purpose, low-power, and many-core architectures. We show, with very little effort, how the BLIS framework yields sequential and parallel implementations that are competitive with the performance of ATLAS, OpenBLAS (an effort to maintain and extend the GotoBLAS), and commercial vendor implementations such as AMD’s ACML, IBM’s ESSL, and Intel’s MKL libraries. Although most of this article focuses on single-core implementation, we also provide compelling results that suggest the framework’s leverage extends to the multithreaded domain.

ACM Transactions on Mathematical Software | 2006

Accumulating Householder transformations, revisited

Thierry Joffrain; Tze Meng Low; Enrique S. Quintana-Ortí; Robert A. van de Geijn; Field G. Van Zee

A theorem related to the accumulation of Householder transformations into a single orthogonal transformation known as the compact WY transform is presented. It provides a simple characterization of the computation of this transformation and suggests an alternative algorithm for computing it. It also suggests an alternative transformation, the UT transform, with the same utility as the compact WY Transform which requires less computation and has similar stability properties. That alternative transformation was first published over a decade ago but has gone unnoticed by the community.

ACM Transactions on Mathematical Software | 2016

Analytical Modeling Is Enough for High-Performance BLIS

Tze Meng Low; Francisco D. Igual; Tyler M. Smith; Enrique S. Quintana-Ortí

We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine tuning parameters for high-end instantiations of the matrix-matrix multiplication. This is of both practical and scientific importance, as it greatly reduces the development effort required for the implementation of the level-3 BLAS while also advancing our understanding of how hierarchically layered memories interact with high-performance software. This allows the community to move on from valuable engineering solutions (empirically autotuning) to scientific understanding (analytical insight).

ACM Transactions on Mathematical Software | 2008

Scalable parallelization of FLAME code via the workqueuing model

Field G. Van Zee; Paolo Bientinesi; Tze Meng Low; Robert A. van de Geijn

We discuss the OpenMP parallelization of linear algebra algorithms that are coded using the Formal Linear Algebra Methods Environment (FLAME) API. This API expresses algorithms at a higher level of abstraction, avoids the use loop and array indices, and represents these algorithms as they are formally derived and presented. We report on two implementations of the workqueuing model, neither of which requires the use of explicit indices to specify parallelism. The first implementation uses the experimental taskq pragma, which may influence the adoption of a similar construct into OpenMP 3.0. The second workqueuing implementation is domain-specific to FLAME but allows us to illustrate the benefits of sorting tasks according to their computational cost prior to parallel execution. In addition, we discuss how scalable parallelization of dense linear algebra algorithms via OpenMP will require a two-dimensional partitioning of operands much like a 2D data distribution is needed on distributed memory architectures. We illustrate the issues and solutions by discussing the parallelization of the symmetric rank-k update and report impressive performance on an SGI system with 14 Itanium2 processors.

SIAM Journal on Scientific Computing | 2014

EXPLOITING SYMMETRY IN TENSORS FOR HIGH PERFORMANCE: MULTIPLICATION WITH SYMMETRIC TENSORS ∗

Martin D. Schatz; Tze Meng Low; Robert A. van de Geijn; Tamara G. Kolda

Symmetric tensor operations arise in a wide variety of computations. However, the benefits of exploiting symmetry in order to reduce storage and computation is in conflict with a desire to simplify memory access patterns. In this paper, we propose a blocked data structure (blocked compact symmetric storage) wherein we consider the tensor by blocks and store only the unique blocks of a symmetric tensor. We propose an algorithm by blocks, already shown of benefit for matrix computations, that exploits this storage format by utilizing a series of temporary tensors to avoid redundant computation. Further, partial symmetry within temporaries is exploited to further avoid redundant storage and redundant computation. A detailed analysis shows that, relative to storing and computing with tensors without taking advantage of symmetry and partial symmetry, storage requirements are reduced by a factor of

acm sigplan symposium on principles and practice of parallel programming | 2005

Extracting SMP parallelism for dense linear algebra algorithms from high-level specifications

Tze Meng Low; Robert A. van de Geijn; Field G. Van Zee

O( m! )

IEEE Control Systems Magazine | 2017

High-Assurance SPIRAL: End-to-End Guarantees for Robot and Car Control

Franz Franchetti; Tze Meng Low; Stefan Mitsch; Juan Pablo Mendoza; Liangyan Gui; Amarin Phaosawasdi; David A. Padua; Soummya Kar; José M. F. Moura; Michael Franusich; Jeremy R. Johnson; André Platzer; Manuela M. Veloso

and computational requirements by a factor of

international parallel and distributed processing symposium | 2016