Chua- Huang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chua- Huang is active.

Explore More

Publication

Featured researches published by Chua- Huang.

Journal of Parallel and Distributed Computing | 1996

Compiling Array Expressions for Efficient Execution on Distributed-Memory Machines

Sandeep K. S. Gupta; S. D. Kaushik; Chua-Huang Huang; P. Sadayappan

Array statements are often used to express data-parallelism in scientific languages such as Fortran 90 and High Performance Fortran. In compiling array statements for a distributed-memory machine, efficient generation of communication sets and local index sets is important. We show that for arrays distributed block-cyclically on multiple processors, the local memory access sequence and communication sets can be efficiently enumerated as closed forms using regular sections. First, closed form solutions are presented for arrays that are distributed using block or cyclic distributions. These closed forms are then used with avirtual processor approachto give an efficient solution for arrays with block-cyclic distributions. This approach is based on viewing a block-cyclic distribution as a block (or cyclic) distribution on a set of virtual processors, which are cyclically (or block-wise) mapped to physical processors. These views are referred to asvirtual-blockorvirtual-cyclicviews, depending on whether a block or cyclic distribution of the array on the virtual processors is used. The virtual processor approach permits different schemes based on the combination of the virtual processor views chosen for the different arrays involved in an array statement. These virtualization schemes have different indexing overhead. We present a strategy for identifying the virtualization scheme which will have the best performance. Performance results on a Cray T3D system are presented for hand-compiled code for array assignments. These results show that using the virtual processor approach, efficient code can be generated for execution of array statements involving block-cyclically distributed arrays.

international conference on parallel processing | 1993

On Compiling Array Expressions for Efficient Execution on Distributed-Memory Machines

Sandeep K. S. Gupta; S. D. Kaushik; S. Mufti; Sanjay Sharma; Chua-Huang Huang; P. Sadayappan

Efficient generation of communication sets and local index sets is important for evaluation of array expressions in scientific languages such as Fortran-90 and High Performance Fortran implemented on distributed-memory machines. We show that for arrays affinely aligned with templates that are distributed on multiple processors with a block-cyclic distribution, the local memory access sequence and communication sets can be efficiently enumerated using closed forms. First, closed form solutions are presented for arrays that are aligned with identity template that are distributed using block or cyclic distributions.

Journal of Parallel and Distributed Computing | 1993

Communication-free hyperplane partitioning of nested loops

Chua-Huang Huang; P. Sadayappan

Abstract This paper addresses the problem of partitioning the iterations of nested loops, and data arrays accessed by the loops. Hyperplane partitions of disjoint subsets of data arrays and loop iterations that result in the elimination of communication are sought. A characterization of necessary and sufficient conditions for communication-free hyperplane partitioning is provided.

international conference on supercomputing | 1994

An approach to communication-efficient data redistribution

S. D. Kaushik; Chua-Huang Huang; Rodney W. Johnson; P. Sadayappan

We address the development of efficient methods for performing data redistribution of arrays on distributed-memory machines. Data redistribution is important for the distributed-memory implementation of data parallel languages such as High Performance Fortran. An algebraic representation of regular data distributions is used to develop an analytical model for evaluating the communication cost of data redistribution. Using this algebraic representation and the analytical model, an approach to communication-efficient data redistribution is developed. Implementation results on the Intel iPSC/860 are reported.

international parallel processing symposium | 1995

Multi-phase array redistribution: modeling and evaluation

S. D. Kaushik; Chua-Huang Huang; J. Ramanujam; P. Sadayappan

Array redistribution is used in languages such as High Performance Fortran to allow programmers to dynamically change the distribution of arrays across processors. Distributed-memory implementations of several scientific applications require array redistribution. In this paper, efficient methods for performing array redistribution are presented. Precise closed forms for determining the processors involved in the communication and the data elements to be communicated are developed for two special cases of array redistribution involving block-cyclically distributed arrays. The general array redistribution problem involving block-cyclically distributed arrays can be expressed in terms of these special cases. Using the closed forms, a cost model for estimating the communication overhead for array redistribution is developed. A multi-phase approach for reducing the communication cost of array redistribution is presented. Experimental results on the Cray T3D to evaluate the multi-phase approach are provided.<<ETX>>

conference on high performance computing (supercomputing) | 1993

Efficient transposition algorithms for large matrices

S. D. Kaushik; Chua-Huang Huang; John R. Johnson; Rodney W. Johnson; P. Sadayappan

The authors present transposition algorithms for matrices that do not fit in main memory. Transposition is interpreted as a permutation of the vector obtained by mapping a matrix to linear memory. Algorithms are derived from factorizations of this permutation, using a class of permutations related to the tensor product. Using this formulation of transposition, the authors first obtain several known algorithms and then they derive a new algorithm which reduces the number of disk accesses required. The new algorithm was compared to existing algorithms using an implementation on the Intel iPSC/860. This comparison shows the benefits of the new algorithm.

languages and compilers for parallel computing | 1991

Communication-Free Hyperplane Partitioning of Nested Loops

Chua-Huang Huang; P. Sadayappan

This paper addresses the problem of partitioning the iterations of nested loops, and data arrays accessed by the loops. Hyperplane partitions of disjoint subsets of data arrays and loop iterations that result in the elimination of communication are sought. A characterization of necessary and sufficient conditions for communicationfree hyperplane partitioning is provided.

Applied Mathematics Letters | 1990

A tensor product formulation of Strassen's matrix multiplication algorithm

Chua-Huang Huang; Jeremy R. Johnson; R.W. Johnson

Abstract Tensor product notation is used to derive an iterative version of Strassens matrix multiplication algorithm.

International Journal of Parallel Programming | 1998

Reuse-Driven Tiling for Improving Data Locality

Jingling Xue; Chua-Huang Huang

This paper applies unimodular transformations and tiling to improve data locality of a loop nest. Due to data dependences and reuse information, not all dimensions of the iteration space will and can be tiled. By using cones to represent data dependences and vector spaces to quantify data reuse in the program, a reuse-driven transformational approach is presented, which aims at maximizing the amount of data reuse carried in the tiled dimensions of the iteration space while keeping the number of tiled dimensions to a minimum (to reduce loop control overhead). In the special case of one single fully permutable loop nest, an algorithm is presented that tiles the program optimally so that all data reuse is carried in the tiled dimensions. In the general case of multiple fully permutable loop nests, data dependences can prevent all data reuse to be carried in the tiled dimensions. An algorithm is presented that aims at localizing data reuse in the tiled dimensions so that the reuse space localized has the largest dimensionality possible.

languages and compilers for parallel computing | 1997

Reuse-Driven Tiling for Data Locality

Jingling Xue; Chua-Huang Huang

This paper applies unimodular transformations and tiling to improve the data locality of a loop nest. Due to data dependences and reuse information, not all loops will and can be tiled. Therefore, the approach proposed in this paper attempts to capture as much data reuse in the cache as possible while tiling as few loops as possible. By using cones to represent the data dependences and vector spaces to represent the reuse information in the program, a reuse-driven approach is presented to improve the data locality of the program. In the special case of a singly fully permutable loop nest, the data locality problem is formulated as an optimisation problem and solved optimally. In the general case, an algorithm is presented that attempts to construct the tiled loop nest in such a way that as much reuse as possible is carried in the innermost tiled loops.

Explore More