Xinan Tang
University of Delaware
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xinan Tang.
international conference on parallel architectures and compilation techniques | 1996
Laurie J. Hendren; Xinan Tang; Yingchun Zhu; Guang R. Gao; Xun Xue; Haiying Cai; Pierre Ouellet
Multithreaded architectures provide an opportunity for efficiently executing programs with irregular parallelism and/or irregular locality. This paper presents a strategy that makes use of the multithreaded execution model without exposing multithreading to the programmer. Our approach is to design simple extensions to C, and to provide compiler support that automatically translates high-level C programs into lower-level threaded programs. In this paper we present EARTH-C our extended C language which contains simple constructs for specifying control parallelism, data locality, shared variables and atomic operations. Based on EARTH-C, we describe compiler techniques that are used for translating to lower-level Threaded-C programs for the EARTH multithreaded architecture. We demonstrate our approach with six benchmark programs. We show that even naive EARTH-C programs can lead to reasonable performance, and that more advanced EARTH-C programs can give performance very close to hand-coded threated-C programs.
acm symposium on parallel algorithms and architectures | 1997
Xinan Tang; Jing Wang; Kevin B. Theobald; Guang R. Gao
There has been considerable interest in implementing a multithreaded program exeeution and architecture model on a multiprocessor whose primary processors consist of today’s off-the-shelf microprocessors. Unlike some custom-designed mr.dtithreaded processor architectures, which can interleave multiple threads concurrently, conventional processors can only execute one thread at a time. This presents a unique and challenging problem to the compiler: partition a program into threads so that it executes both correctly and in minimal time. We present a new heuristic algorithm based on an interesting extension of the classical list scheduling algorithm. Based on a cost model, our algorithm groups instructions into t breads by considering the trade-offs among parallelism, latency tolerance, thread switching costs and sequential execution efficiency. The proposed algorithm has been implemented, and its performance measured through experiments on a variety of architecture parameters and a wide range of program parameters. The results show that the proposed algorithm is robust, effective, and efficient.
international conference on parallel architectures and compilation techniques | 1997
Xinan Tang; Rakesh Ghiya; Laurie J. Hendren; Guang R. Gao
Traditional compiler optimizations such as loop invariant removal and common sub-expression elimination are standard in all optimizing compilers. The purpose of the paper is to present new versions of these optimizations that apply to programs using dynamically allocated data structures, and to show the effect of these optimizations on the performance of multithreaded programs. We show how heap pointer analyses can be used to support better dependence testing, new applications of the above traditional optimizations, and high quality code generation for multithreaded architectures. We have implemented these analyses and optimizations in the EARTH-C compiler to study their impact on the performance of generated multithreaded code. We provide both static and dynamic measurements showing the effect of the optimizations applied individually, and together. We note several general trends, and discuss the performance tradeoffs and suggest when specific optimizations are generally beneficial.
Journal of Parallel and Distributed Computing | 1999
Xinan Tang; Guang R. Gao
There is an enormous amount of parallelism exposed to fine-grain multithreaded architectures to cover latencies. It is a demanding task for a multithreading programmer to manage such a degree of parallelism by hand. To use multithreaded architectures efficiently it is essential to have compiler support for automatically partitioning programs into threads. This paper solves a fundamental problem in compiling for multithreaded architectures, automatically partitioning a program into threads. The focus of such partitioning is to overlap the remote communication latency and minimize the total execution time. We first formulate the partitioning problem based on a multithreaded execution cost model. Then, we prove such a formulation is NP-hard. Therefore, we propose two heuristic thread-partitioning methods to solve this problem in practice. The advanced partitioning algorithm is a novel extension of list scheduling, and it takes advantage of the cost model to generate near-optimum partitioning results. The remote-path-based partitioning algorithm is a simplified version of the advanced one but it is easy for compiler implementation. The two partitioning algorithms were implemented respectively in a thread partitioning testbed and a research EARTH-C compiler. The experimental results show that both partitioning algorithms are effective to generate efficient threaded code, and code generated by the compiler is comparable to hand-written code.
acm symposium on parallel algorithms and architectures | 1998
Xinan Tang; Guang R. Gao
Adequate compiler support is essential to take advantage of the emerging multithreaded architecture. In this paper, we address two important questions in thread partitioning, which is a key step in compiler design for multithreaded architectures. The questions in which we are interested are: how “hard” is it to partition threads and how “bad” will a heuristic partitioning algorithm be? We propose a cost model for both multithreaded machines and user programs, and we formulate the thread partition problem as an optimization problem. Then, we answer the above two questions by proving that: 1) for the class of programs and architecture models we are interested in, the problem of thread partition for minimum execution time is NP-hard; 2) the run length produced by any list scheduling based thread partitioning algorithm is at most twice as long as that of an optimal solution.
ieee international conference on high performance computing data and analytics | 2000
José Nelson Amaral; Guang R. Gao; Erturk Dogan Kocalar; Patrick O'Neill; Xinan Tang
The development of fine-grain multi-threaded program execution models has created an interesting challenge: how to partition a program into threads that can exploit machine parallelism, achieve latency tolerance, and maintain reasonable locality of reference? A successful algorithm must produce a thread partition that best utilizes multiple execution units on a single processing node and handles long and unpredictable latencies. In this paper, we introduce a new thread partitioning algorithm that can meet the above challenge for a range of machine architecture models. A quantitative affinity heuristic is introduced to guide the placement of operations into threads. This heuristic addresses the trade-off between exploiting parallelism and preserving locality. The algorithm is surprisingly simple due to the use of a time-ordered event list to account for the multiple execution unit activities. We have implemented the proposed algorithm and our experiments, performed on a wide range of examples, have demonstrated its efficiency and effectiveness.
international conference on supercomputing | 2000
Gary M. Zoppetti; Gagan Agrawal; Lori L. Pollock; José Nelson Amaral; Xinan Tang; Guang R. Gao
Multithreaded architectures are emerging as an important class of parallel machines. By allowing fast context switching between threads on the same processor, these systems hide communication and synchronization latencies and allow scalable parallelism for dynamic and irregular applications. Thread partitioning is the most important task in compiling high-level languages for multithreaded architectures. Non-preemptive multithreaded architectures, which can be built from off-the-shelf components, require that if a thread issues a potentially remote memory request, then any statement that is dependent upon this request must be in a separate thread. When performing thread partitioning on codes that use pointer-based recursive data structures, it is often difficult to extract accurate dependence information. As a result, threads of unnecessarily small granularity get generated, which, because of thread switching costs, leads to increased execution time. In this paper, we present three techniques that lead to improved extraction and representation of dependence information in the presence of structured control flow, references through fields of structures, and pointer-based data structures. The benefit of these techniques is the generation of coarser-grained threads and, therefore, decreased execution time. Our experiments were performed using the EARTH-C compiler and the EARTH multithreaded architecture model emulated on both a cluster of Pentium PCs and a distributed memory multiprocessor. On our set of 6 pointer-based programs, these techniques reduced the static number of threads by 38%. Reductions in execution times ranged from 16% to 45% on the four programs we measured runtime performance.
international parallel processing symposium | 1999
Shigeru Kusakabe; Kentaro Inenaga; Makoto Amamiya; Xinan Tang; Andres Marquez; Guang R. Gao
The combination of a language with fine-grain implicit parallelism and a dataflow evaluation scheme is suitable for high-level programming on massively parallel architectures. We are developing a compiler of V, a non-strict functional programming language, for EARTH(Efficient Architecture for Running THreads). Our compiler generates codes in Threaded-C, which is a lower-level programming language for EARTH. We have developed translation rules, and integrated them into the compiler. Since overhead caused by fine-grain processing may degrade performance for programs with little parallelism, we have adopted a thread merging rule. The preliminary performance results are encouraging. Although further improvement is required for non-strict data-structures, some codes generated from V programs by our compiler achieved comparable performance with the performance of hand-written Threaded-C codes.
Innovative Architecture for Future Generation High-Performance Processors and Systems | 1998
Shigeru Kusakabe; Kentaro Inenaga; Makoto Amamiya; Xinan Tang; Andres Marquez; Guang R. Gao
The combination of a language with fine-grain implicit parallelism and a dataflow evaluation scheme is suitable for high-level programming on massively parallel architectures. We are developing a compiler of V, a non-strict functional programming language, for EARTH(Eficient Architecture for Running THreads). Our compiler generates codes in Threaded-C, which is a lower-level programming language for EARTH. We have developed translation rules, and integrated them into the compiler. While EARTH directly supports fine-grain thread execution, thread-level optimization by compiler is also.effective on EARTH. The preliminary performance results are encouraging, although further improvement is required for non-strict datastructures. Some codes generated from V programs by our compiler achieved comparable performance with the performance of hand-written Threaded-C codes.
international conference on parallel architectures and compilation techniques | 1995
Herbert H. J. Hum; Olivier Maquelin; Kevin B. Theobald; Xinmin Tian; Xinan Tang; Guang R. Gao; Phil Cupryk; Nasser Elmasri; Laurie J. Hendren; Alberto Jimenez; Shoba Krishnan; Andres Marquez; Shamir Merali; Shashank S. Nemawarkar; Prakash Panangaden; Xun Xue; Yingchun Zhu