Haigeng Wang
University of California, Irvine
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Haigeng Wang.
acm sigplan symposium on principles and practice of parallel programming | 1991
Alexandru Nicolau; Haigeng Wang
Given x 1 ; . . . ; x N , parallel pre x computes x 1 x 2 . . . x k , for 1 k N , with associative operation . We show optimal schedules for parallel pre x computation with a xed number of resources p 2 for a pre x of size N p(p + 1)=2 . The time of the optimal schedules with p resources is d2N=(p + 1)e for N p(p + 1)=2, which we prove to be the strict lower bound(i.e., which is what can be achieved maximally). We then present a pipelined form of optimal schedules with d2N=(p + 1)e + d(p 1)=2e 1 time, which takes a constant overhead of d(p 1)=2e time more than the optimal schedules. Parallel pre x is an important common operation in many algorithms including the evaluation of polynomials, general Hornor expressions, carry look-ahead circuits and ranking and packing problems. A most important application of parallel pre x is loop parallelizing transformation.
IEEE Transactions on Parallel and Distributed Systems | 1996
Haigeng Wang; Alexandru Nicolau; Stephen Keung; Kai-Yeung Siu
Many large-scale scientific and engineering computations, e.g., some of the Grand Challenge problems, spend a major portion of execution time in their core loops computing band linear recurrences (BLRs). Conventional compiler parallelization techniques cannot generate scalable parallel code for this type of computation because they respect loop-carried dependences (LCDs) in programs, and there is a limited amount of parallelism in a BLR with respect to LCDs. For many applications, using library routines to replace the core BLR requires the separation of BLR from its dependent computation, which usually incurs significant overhead. In this paper, we present a new scalable algorithm called the Regular Schedule, for parallel evaluation of BLRs. We describe our implementation of the Regular Schedule and discuss how to obtain maximum memory throughput in implementing the schedule on vector supercomputers. We also illustrate our approach, based on our Regular Schedule, to parallelizing programs containing BLR and other kinds of code. Significant improvements in CPU performance for a range of programs containing BLR implemented using the Regular Schedule in C over the same programs implemented using highly optimized coded-in-assembly BLAS routines [11] are demonstrated on Convex C240. Our approach can be used both at the user level in parallel programming code containing BLRs, and in compiler parallelization of such programs combined with recurrence recognition techniques for vector supercomputers.
european design automation conference | 1992
Haigeng Wang; Nikil D. Dutt; Alexandru Nicolau
Linear difference equations involving recurrences are fundamental equations that describe many important signal processing applications. For many high sample rate digital filter applications, it is necessary to effectively parallelize the linear difference equations used to describe digital filters. This is difficult because of the recurrences inherent in the data dependences. The authors present a novel approach, harmonic scheduling, that exploits parallelism in these recurrences beyond loop-carried dependencies, and which generates optimal schedules for parallel evaluation of linear difference equations with resource constraints. This approach also enables the derivation of a parallel schedule with minimum control overhead, given an execution time with resource constraints. A harmonic scheduling algorithm is presented to generate optimal schedules for digital filters described by second-order difference equations with resource constraints.<<ETX>>
IEEE Transactions on Circuits and Systems Ii: Analog and Digital Signal Processing | 1999
Haigeng Wang; Nikil D. Dutt; Alexandru Nicolau
Linear difference equations involving recurrences are fundamental equations that describe many important signal processes; in particular, infinite-duration impulse response (IIR) filters. Applying conventional dependence-preserving parallelization techniques such as software pipelining can only extract limited parallelism due to loop-carried dependences in the linear recurrences, and thus, cannot achieve scalable speedup given more resources. Furthermore, the previously published scheduling techniques did not address the tradeoffs between resource constraints and the processing speed of the resulting schedules, and thus, do not have the capability of exploring the design space of parallel schedules implementing IIR filters. In this paper, we present a novel approach, based on harmonic scheduling, that addresses the tradeoffs between resource constraints and the processing speed of the resulting schedules, which can be used to explore the design space of scalable parallel schedules implementing IIR filters with resource constraints. The salient features of our approach include a mathematical formulation of the relationship between the schedules, resource constraints and target performance, and capabilities for exploring design space in terms of those parameters. In particular, our approach can be used to successively approximate time-optimal schedules implementing IIR filters for a given target architecture. We illustrate our approach by giving an algorithm for deriving scalable schedules for IIR filters with a fixed number of identical multifunctional processors. As a further illustration, we derive rate-optimal schedules for IIR filters under more realistic constraints: using a fixed number of adders and multipliers and assuming that multiplication and addition take dissimilar execution times.
international conference on vlsi design | 1993
Haigeng Wang; Nikil D. Dutt; Alexandru Nicolau
Loop-Carried Dependencies (LCDs) are ubiquitous in behaviors that describe recurrences. Effective parallelization and scheduling of behaviors involving recurrences is a difficult task but is crucial for several real-time high-performance applications such as DSP. We present a novel approach, Harmonic Scheduling, that exploits the parallelism in these recurrences beyond LCDs and which generates optimal schedules for parallel evaluation of linear difference equations with resource constraints. Based on this approach, we then formulate the problem of and give an algorithm for finding optimal schedules for linear difference equations under more realistic constraints: using a fixed number of adders and multipliers that have different execution times.
european design automation conference | 1993
Haigeng Wang; Nikil D. Dutt; Alexandru Nicolau
The authors present regular schedules, a class of parallel schedules for computing mth-order infinite-impulse response (IIR) filters. These schedules permit the implementation of IIR filters on a family of scalable parallel architectures with varying price/performance characteristics, enabling designers to effectively explore the design space of parallel IIR filter implementations. The technique is illustrated on a target architecture comprising application-specific instruction processors (ASIPs) clustered on multichip modules (MCMs), with the MCMs connected through a scalable interconnection network. The simplicity of the regular schedules facilitates characterization of their interprocessor communications, which makes it possible to generate instruction-level behavior of the design that can be easily mapped onto ASIP architectures. Preliminary results of design space exploration for the fifth-order elliptic wave filter benchmark on the interconnected ASIP architectures are presented.<<ETX>>
international symposium on microarchitecture | 1991
Haigeng Wang; Alexandru Nicolau; Roni Potasman
Removing redundant loop induction variables(IV’s) in a sequential program can improve the code performance by making effective use of registers and reducing the dynamic instruction count in the loop. At the microcode level and in high-performance, fine-grain parallel architectures, it is even more important that a parallelizing compiler is able to remove redundant IV’s generated as a by-product of parallelizing transformations. Conventional IV detection algorithm fails in finding an IV family with no basic IV. Copy propagation in general cannot transform an IV family with no basic IV into a family with a basic IV. As a result, conventional IV removal method would not work for more general types of IV families, which often result from loop parallelizing transformations and also exist in sequential programs. Furthermore, IV removal by copy propagation with loop unrolling cannot preserve the semantic of the original code in addition to its space-inefficiency. We present in this paper a new technique for redundant IV removal. It can remove redundant IV’s from more general types of IV families without an overhead of code size increase, which is inevitably incurred by other methods such aa loop unwinding and copy propagation with node splitting . It can also be used to determine whether redundant IV’s should be removed(i.e., benefits the overall performance). We then demonstrate the effectiveness of this technique using some benchmarks. Pcrmisston to copy without fee all or part of this material is granted pro. vlded that the copies are not made or distributed for direct commerc]a 1 advantage, the ACM copyrtght notms and the tMe of the pubhcation and m date appear, and notice is given that copying is by permission of the Association for Computing Machinety. To copy othetwise, or to repubhsh,requm?s a fee andlor specl!ic permission. O 1991 ACM 0-89791-460-0/91/0011/0172
languages and compilers for parallel computing | 1991
Alexandru Nicolau; Roni Potasman; Haigeng Wang
1.50 *This work is supported h part by NSF grant CCRS704367 and ONR graut NOO014S6K0215 .
international conference on supercomputing | 1992
Haigeng Wang; Alexandru Nicolau
design automation conference | 1993
Haigeng Wang; Nikil D. Dutt; Alexandru Nicolau; Kai-Yeung Siu