Hongzhang Shan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hongzhang Shan is active.

Explore More

Publication

Featured researches published by Hongzhang Shan.

acm sigplan symposium on principles and practice of parallel programming | 1997

Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors

Dongming Jiang; Hongzhang Shan; Jaswinder Pal Singh

The performance portability of parallel programs across a wide range of emerging coherent shared address space systems is not well understood. Programs that run well on efficient, hardware cache-coherent systems often do not perform well on less optimal or more commodity-based communication architectures. This paper studies this issue of performance portability, with the commodity communication architecture of interest being page-grained shared virtual memory. We begin with applications that perform well on moderat scale hardware cache-coherent systems, and find that they do not do so well on SVM systems. Then, we examine whether and how the applications can be improved for SVM systems --- through data structuring or algorithmic enhancements---and the nature and difficulty of the optimization. Finally, we examine the impact of the successful optimizations on hardware-coherent platforms themselves, to see whether they are helpful, harmful or neutral on those platforms. We develop a systematic methodology to explore optimizations in different structured classes. The results, and the difficulty of the optimizations, lead insight not only into performance portability but also into the viability of SVM as a platform for these types of applications.

International Journal of Parallel Programming | 2001

A Comparison of MPI, SHMEM and Cache-Coherent Shared Address Space Programming Models on a Tightly-Coupled Multiprocessors

Hongzhang Shan; Jaswinder Pal Singh

We compare the performance of three major programming models on a modern, 64-processor hardware cache-coherent machine, one of the two major types of platforms upon which high-performance computing is converging. We focus on applications that are either regular, predictable or at least do not require fine-grained dynamic replication of irregularly accessed data. Within this class, we use programs with a range of important communication patterns. We examine whether the basic parallel algorithm and communication structuring approaches needed for best performance are similar or different among the models, whether some models have substantial performance advantages over others as problem size and number of processors change, what the sources of these performance differences are, where the programs spend their time, and whether substantial improvements can be obtained by modifying either the application programming interfaces or the implementations of the programming models on this type of tightly-coupled multiprocessor platform.

Archive | 2007

Understanding and Mitigating Multicore Performance Issues on theAMD Opteron Architecture

John M. Levesque; Jeff Larkin; Martyn Foster; Joe Glenski; Garry Geissler; Stephen Whalen; Brian Waldecker; Jonathan Carter; David Skinner; Helen He; Harvey Wasserman; John Shalf; Hongzhang Shan; Erich Strohmaier

Over the past 15 years, microprocessor performance hasdoubled approximately every 18 months through increased clock rates andprocessing efficiency. In the past few years, clock frequency growth hasstalled, and microprocessor manufacturers such as AMD have moved towardsdoubling the number of cores every 18 months in order to maintainhistorical growth rates in chip performance. This document investigatesthe ramifications of multicore processor technology on the new Cray XT4?systems based on AMD processor technology. We begin by walking throughthe AMD single-core and dual-core and upcoming quad-core processorarchitectures. This is followed by a discussion of methods for collectingperformance counter data to understand code performance on the Cray XT3?and XT4? systems. We then use the performance counter data to analyze theimpact of multicore processors on the performance of microbenchmarks suchas STREAM, application kernels such as the NAS Parallel Benchmarks, andfull application codes that comprise the NERSC-5 SSP benchmark suite. Weexplore compiler options and software optimization techniques that canmitigate the memory bandwidth contention that can reduce computingefficiency on multicore processors. The last section provides a casestudy of applying the dual-core optimizations to the NAS ParallelBenchmarks to dramatically improve their performance.

international conference on supercomputing | 1999

A comparison of MPI, SHMEM and cache-coherent shared address space programming models on the SGI Origin2000

Hongzhang Shan; Jaswinder Pal Singh

We compare the performance of three major programming models— a load-store cache-coherent shared address space (CC-SAS), message passing (MP) and the segmented SHMEM model—on a modern, 64-processor hardware cache-coherent machine, one of the two major types of platforms upon which high-performance computing is converging. We focus on applications that are either r egular and predictable or at least do not require fine-grained dy namic replication of irregularly accessed data. Within this clas s, we use programs with a range of important communication patterns. We examine whether the basic parallel algorithm and communication structuring approaches needed for best performance are sim ilar or different among the models, whether some models have substantial performance advantages over others as problem size and number of processors change, what the sources of these performance differences are, where the programs spend their time, and whether substantial improvements can be obtained by modifying either the application programming interfaces or the implementations of the programming models on this type of platform.

Lawrence Berkeley National Laboratory | 2004

Job scheduling in a heterogenous grid environment

Leonid Oliker; Rupak Biswas; Hongzhang Shan; Warren Smith

Computational grids have the potential for solving large-scale scientific problems using heterogeneous and geographically distributed resources. However, a number of major technical hurdles must be overcome before this goal can be fully realized. One problem critical to the effective utilization of computational grids is efficient job scheduling. Our prior work addressed this challenge by defining a grid scheduling architecture and several job migration strategies. The focus of this study is to explore the impact of data migration under a variety of demanding grid conditions. We evaluate our grid scheduling algorithms by simulating compute servers, various groupings of servers into sites, and inter-server networks, using real workloads obtained from leading supercomputing centers. Several key performance metrics are used to compare the behavior of our algorithms against reference local and centralized scheduling schemes. Results show the tremendous benefits of grid scheduling, even in the presence of input/output data migration - while highlighting the importance of utilizing communication-aware scheduling schemes.Computational grids have the potential for solving large-scale scientific problems using heterogeneous and geographically distributed resources. However, a number of major technical hurdles must be overcome before this potential can be realized. One problem that is critical to effective utilization of computational grids is the efficient scheduling of jobs. This work addresses this problem by describing and evaluating a grid scheduling architecture and three job migration algorithms. The architecture is scalable and does not assume control of local site resources. The job migration policies use the availability and performance of computer systems, the network bandwidth available between systems, and the volume of input and output data associated with each job. An extensive performance comparison is presented using real workloads from leading computational centers. The results, based on several key metrics, demonstrate that the performance of our distributed migration algorithms is significantly greater than that of a local scheduling framework and comparable to a non-scalable global scheduling approach.

merged international parallel processing symposium and symposium on parallel and distributed processing | 1998

Parallel tree building on a range of shared address space multiprocessors: algorithms and application performance

Hongzhang Shan; Jaswinder Pal Singh

Irregular, particle-based applications that use trees, for example hierarchical N-body applications, are important consumers of multiprocessor cycles, and are argued to benefit greatly in programming ease from a coherent shared address space programming model. As more and more supercomputing platforms that can support different programming models become available to users, from tightly-coupled hardware-coherent machines to clusters of workstations or SMPs, to truly deliver on its ease of programming advantages to application users it is important that the shared address space model not only perform and scale well in the tightly-coupled case but also port well in performance across the range of platforms (as the message passing model can). For tree-based N-body applications, this is currently not true: while the actual computation of interactions ports well, the parallel tree building phase can become a severe bottleneck on coherent shared address space platforms, in particular on platforms with less aggressive, commodity-oriented communication architectures (even though it takes less 3 percent of the time in most sequential executions). The authors therefore investigate the performance of five parallel tree building methods in the context of a complete galaxy simulation on four very different platforms that support this programming model.

international parallel and distributed processing symposium | 2001

Message passing vs. shared address space on a cluster of SMPs

Hongzhang Shan; Jaswinder Pal Singh; Leonid Oliker; Rupak Biswas

The emergence of scalable computer architectures using clusters of PCs (or PC-SMPs) with commodity networking has made them attractive platforms for high-end scientific computing. Currently, message passing (MP) and shared address space (SAS) are the two leading programming paradigms for these systems. MP has been standardized with MPI, and is the most common and mature parallel programming approach. However, MP code development can be extremely difficult, especially for irregularly structured computations. SAS offers substantial ease of programming, but may suffer from performance limitations due to poor spatial locality and high protocol overhead. In this paper, we compare the performance of and programming effort required for six applications under both programming models on a 32-CPU PC-SMP cluster. Our application suite consists of codes that typically do not exhibit scalable performance under shared-memory programming due to their high communication-to-computation ratios and complex communication patterns. Results indicate that SAS can achieve about half the parallel efficiency of MPI for most of our applications; however on certain classes of problems, SAS performance is competitive with MPI.

conference on high performance computing (supercomputing) | 1999

Parallel Sorting on Cache-coherent DSM Multiprocessors

Hongzhang Shan; Jaswinder Pal Singh

The performance of parallel sorting is not well understood on hardware cache-coherent shared address space (CC-SAS) multiprocessors, which increasingly dominate the market for tightly-coupled multiprocessing. We study two high-performance parallel sorting algorithms, radix and sample sorting, under three major programming models-a load-store CC-SAS, message passing, and the segmented SHMEM model-on a 64-processor SGI Origin2000. We observe surprisingly good speedups on this demanding application. The performance of radix sort is greatly affected by the programming model and particular implementation used. Sample sort exhibits more uniform performance across programming models on this platform, but it is usually not so good as that of the best radix sort for larger data sets if each is allowed to use the best programming model for itself. The best combination of algorithm and programming model is radix sorting under the SHMEM model for larger data sets and sample sorting under CC-SAS for smaller data sets.

Twelfth International Conference on Advances inComputing&Communications (ADCOM), Ahmedabad, INDIA, DECEMBER 15-182004 | 2004