Ziang Hu
University of Delaware
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ziang Hu.
international symposium on computer architecture | 2007
Weirong Zhu; Vugranam C. Sreedhar; Ziang Hu; Guang R. Gao
Efficient fine-grain synchronization is extremely important to effectively harness the computational power of many-core architectures. However, designing and implementing finegrain synchronization in such architectures presents several challenges, including issues of synchronization induced overhead, storage cost, scalability, and the level of granularity to which synchronization is applicable. This paper proposes the Synchronization State Buffer (SSB), a scalable architectural design for fine-grain synchronization that efficiently performs synchronizations between concurrent threads. The design of SSB is motivated by the following observation: at any instance during the parallel execution only a small fraction of memory locations are actively participating in synchronization. Based on this observation we present a fine-grain synchronization design that records and manages the states of frequently synchronized data using modest hardware support. We have implemented the SSB design in the context of the 160-core IBM Cyclops-64 architecture. Using detailed simulation, we present our experience for a set of benchmarks with different workload characteristics.
international parallel and distributed processing symposium | 2005
J. del Cuvillo; Weirong Zhu; Ziang Hu; Guang R. Gao
This paper presents the design and implementation of a thread virtual machine, called TNT (or TiNy-Threads) for the IBM Cyclops64 architecture (the latest Cyclops architecture that employs a unique multiprocessor-on-a-chip design with a very large number of hardware thread units and embedded memory) - as the cornerstone of the C64 system software. We highlight how to achieve high efficiency by mapping (and matching) the TNT thread model directly to the Cyclops ISA features assisted by a native TNT thread runtime library. Major results of our experimental study demonstrate good efficiency, scalability and usability of our TNT model/implementation.
international parallel and distributed processing symposium | 2007
Long Chen; Ziang Hu; Junmin Lin; Guang R. Gao
The rapid revolution in microprocessor chip architecture due to multicore technology is presenting unprecedented challenges to the application developers as well as system software designers: how to best exploit the parallelism potential due to such multi-core architectures? In this paper, we report an in-depth study on such challenges based on our experience of optimizing the fast Fourier transform (FFT) on the IBM Cyclops-64 chip architecture - a large-scale multi-core chip architecture consisting 160 thread units, associated memory banks and an interconnection network that connect them together in a shared memory organization. We demonstrate how multi-core architectures like the C64 could be used to achieve a high performance implementation of FFT both in 1D and 2D cases. We analyze the optimization challenges and opportunities including problem decomposition, load balancing, work distribution, and data-reuse, together with the exploiting of the C64 architecture features such as the multi-level of memory hierarchy and large register files. Furthermore, the experience learned during the hand-tuned optimization process have provided valuable guidance in our compiler optimization design and implementation. The main contributions of this paper include: 1) our study demonstrates that successful optimization for C64-like large-scale multi-core architectures requires a careful analysis that can identify certain domain-specific features of a target application (e.g. FFT) and match them well with some key multi-core architecture features; 2) our optimization, assisted with hand-tuned process, provided quantitative evidence on the importance of each optimization identified in 1); 3) automatic optimization by our compiler, the design and implementation of which is guided by the feedbacks from 1) and 2), shows excellent results that are often comparable to the results derived from our time-consuming hand-tuned code.
ieee international conference on high performance computing data and analytics | 2006
Juan del Cuvillo; Weirong Zhu; Ziang Hu; Guang R. Gao
This paper presents the initial design of the Cyclops-64 (C64) system software infrastructure and tools under development as a joint effort between IBM T.J. Watson Research Center, ETI Inc. and the University of Delaware. The C64 system is the latest version of the Cyclops cellular architecture that consists of a large number of compute nodes each employs a multiprocessor-on-a-chip architecture with 160 hardware thread units. The first version of the C64 system software has been developed and is now under evaluation. The current version of the C64 software infrastructure includes a C64 toolchain (compiler, linker, functionally accurate simulator, runtime thread library, etc.) and other tools for system control (system initialization, diagnostics and recovery, job scheduler, program launching, etc.) This paper focuses on the following aspects of the C64 system software: (1) the C64 software toolchain; (2) the C64 Thread Virtual Machine (C64 TVM) with emphasis on TiNy ThreadsTM, the implementation of the C64 TVM; (3) the system software for host control. In addition, we illustrate, through two case studies, what an application developer can expect from the C64 architecture as well as some advantages of this architecture, in particular, how it provides a cost-effective solution. A C64 chip’s performance varies across different applications from 5 to 35 times faster than common off-the-self microprocessors.
international conference on parallel processing | 2006
Ziang Hu; Juan del Cuvillo; Weirong Zhu; Guang R. Gao
This paper presents a study of performance optimization of dense matrix multiplication on IBM Cyclops-64(C64) chip architecture. Although much has been published on how to optimize dense matrix applications on shared memory architecture with multi-level caches, little has been reported on the applicability of the existing methods to the new generation of multi-core architectures like C64. For such architectures a more economical use of on-chip storage resources appears to discourage the use of caches, while providing tremendous on-chip memory bandwidth per storage area. This paper presents an in-depth case study of a collection of well known optimization methods and tries to re-engineer them to address the new challenges and opportunities provided by this emerging class of multi-core chip architectures. Our study demonstrates that efficiently exploiting the memory hierarchy is the key to achieving good performance. The main contributions of this paper include: (a) identifying a set of key optimizations for C64-like architectures, and (b) exploring a practical order of the optimizations, which yields good performance for applications like matrix multiplication.
languages and compilers for parallel computing | 2003
Hongbo Yang; R. Govindarajan; Guang R. Gao; Ziang Hu
Recent research results show that conventional hardware-only cache solutions result in unsatisfactory cache utilization for both regular and irregular applications. To overcome this problem, a number of architectures introduce instruction hints to assist cache replacement. For example, Intel Itanium architecture augments memory accessing instructions with cache hints to distinguish data that will be referenced in the near future from the rest. With the availability of such methods, the performance of the underlying cache architecture critically depends on the ability of the compiler to generate code with appropriate cache hints. In this paper we formulate this problem – giving cache hints to memory instructions such that cache miss rate is minimized – as a 0/1 knapsack problem, which can be efficiently solved using a dynamic programming algorithm. The proposed approach has been implemented in our compiler testbed and evaluated on a set of scientific computing benchmarks. Initial results show that our approach is effective on reducing the cache miss rate and improving program performance.
international parallel and distributed processing symposium | 2003
Guang R. Gao; Kevin B. Theobald; R. Govindarajan; Clement Leung; Ziang Hu; Haiping Wu; Jizhu Lu; J. del Cuvillo; Adeline Jacquet; Vincent Janot; Thomas L. Sterling
Future high-end computers which promise very high performance require sophisticated program execution models and languages in order to deal with very high latencies across the memory hierarchy and to exploit massive parallelism. This paper presents our progress in an ongoing research toward this goal. Specifically we develop a suitable program execution model, a high-level programming notation which shields the application developer from the complexities of the architecture, and a compiler and runtime system based on the underlying models. In particular, we propose fine-grain multithreading and thread percolation as key components of our program execution model. We investigate implementing these models and systems on novel architectures such as the HTMT architecture and IBMs Blue Gene. Also, we report early performance prediction of thread percolation and its impact on execution time.
network and parallel computing | 2005
Yanwei Niu; Ziang Hu; Kenneth E. Barner; Guang R. Gao
This paper focuses on the Cyclops64 computer architecture and presents an analytical model and performance simulation results for the preloading and loop unrolling approaches to optimize the performance of SVD (Singular Value Decomposition) benchmark. A performance model for dissecting the total execution cycles is presented. The data preloading using “memcpy” or hand optimized “inline” assembly code, and the loop unrolling approach are implemented and compared with each other in terms of the total number of memory access cycles. The key idea is to preload data from offchip to onchip memory and store the data back after the computation. These approaches can reduce the total memory access cycles and can thus improve the benchmark performance significantly.
international parallel and distributed processing symposium | 2002
Guang R. Gao; Kevin B. Theobald; Ziang Hu; Haiping Wu; Jizhu Lu; Thomas L. Sterling; Kevin Pingali; Paul Stodghill; Rick Stevens; Mark Hereld
Future high-end computers will offer great performance improvements over todays machines, enabling applications of far greater complexity. However, designers must solve the challenge of exploiting massive parallelism efficiency in the face of very high latencies across the memory hierarchy. We believe the key to meeting this challenge is the design and implementation of new models and languages which address the problems of parallelism and latency on such machines. This paper presents an overview of our ongoing research toward this goal. Specifically, we will develop a suitable program execution model, a high-level programming notation, and a compiler and runtime system based on the underlying models. These are based on our previous work in parallel multithreaded systems, but are suitably enhanced to meet the needs of future high-end computers.
international parallel and distributed processing symposium | 2007
Ge Gan; Ziang Hu; J. del Cuvillo; Guang R. Gao
The IBM Cyclops-64 (C64) chip employs a multithreaded architecture that integrates a large number of hardware thread units on a single chip. A cellular supercomputer is being developed based on a 3D-mesh connection of the C64 chips. This paper introduces the Cyclops datagram protocol (CDP) developed for the C64 supercomputer system. CDP is inspired by the TCP/IP protocol, yet simpler and more compact. The implementation of CDP leverages the abundant hardware thread-level parallelism provided by the C64 multithreaded architecture. The main contributions of this paper are: (1) We have completed a design and implementation of CDP that is used as the fundamental communication infrastructure for the C64 supercomputer system. (2) CDP successfully exploits the massive thread-level parallelism provided on the C64 hardware, achieving good performance scalability; (3) CDP is quite efficient. Its peak throughput reaches 884Mbps on the gigabit Ethernet, even it is running at the user-level on a single-processor Linux machine; (4) Extensive application test cases are passed and no reliability problems have been reported.