Dongming Jiang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dongming Jiang is active.

Explore More

Publication

Featured researches published by Dongming Jiang.

acm sigplan symposium on principles and practice of parallel programming | 1997

Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors

Dongming Jiang; Hongzhang Shan; Jaswinder Pal Singh

The performance portability of parallel programs across a wide range of emerging coherent shared address space systems is not well understood. Programs that run well on efficient, hardware cache-coherent systems often do not perform well on less optimal or more commodity-based communication architectures. This paper studies this issue of performance portability, with the commodity communication architecture of interest being page-grained shared virtual memory. We begin with applications that perform well on moderat scale hardware cache-coherent systems, and find that they do not do so well on SVM systems. Then, we examine whether and how the applications can be improved for SVM systems --- through data structuring or algorithmic enhancements---and the nature and difficulty of the optimization. Finally, we examine the impact of the successful optimizations on hardware-coherent platforms themselves, to see whether they are helpful, harmful or neutral on those platforms. We develop a systematic methodology to explore optimizations in different structured classes. The results, and the difficulty of the optimizations, lead insight not only into performance portability but also into the viability of SVM as a platform for these types of applications.

international symposium on computer architecture | 1999

Scaling application performance on a cache-coherent multiprocessors

Dongming Jiang; Jaswinder Pal Singh

Hardware-coherent, distributed shared address space systems are increasingly successful at moderate scale. However, it is unclear whether, or with how much difficulty, the performance of a load-store shared address space programming model scales to large processor counts on real applications. We examine this question using an aggressive case-study machine, the SGI Origin2000, up to 128 processors. We show for the first time that scalable performance can indeed be achieved in this programming model on a wide range of applications, including challenging kernels like FFT. However, this does not come easily, even for applications considered to be already highly optimized, and is very often not simply a matter of increasing problem size. Rather, substantial further application restructuring is often needed, which is usually quite algorithmic in nature. We examine how the restructurings compare with those needed for performance portability to shared virtual memory on clusters, and we comment on common programming guidelines for performance portability and scalability as well as on how the programming difficulty compares with that of explicit message passing. We also examine where applications spend their time on this large machine, the impact of special hardware features that the machine provides, and the impact of mapping to the network topology.

measurement and modeling of computer systems | 1999

Evaluating synchronization on shared address space multiprocessors: methodology and performance

Sanjeev Kumar; Dongming Jiang; Rohit Chandra; Jaswinder Pal Singh

Synchronization is an area that exhibits rich hardware-software interactions in multiprocessors. It was studied extensively using microbenchmarks a decade ago. However, its performance implications are not well understood on modern systems or on real applications. We study the impact of synchronization primitives and algorithms on a modern, 64processor, hardware-coherent shared address space multiprocessor: the SGI Origin 2000. In addition to the actual results on a modern system, we examine the key methodological issues in studying synchronization, for both microbenchmarks and applications. We find that although the efficient hardware support (Fetch&Op) for synchronization provided on our machine usually helps lock and barrier microbenchmarks, it does not help in improving application performance when compared to good software algorithms that use the processor-provided LL-SC instructions. This is true even in applications that spend a significant amount of time in synchronization operations. More elaborate hardware support is unlikely to have a significant benefit either. From the applications’ perspective, it is usually the waiting time due to load imbalance or serialization that dominates synchronization time, not the overhead of the synchronization operations themselves, even in apparently balanced cases where the overhead may be expected to be substantial.

measurement and modeling of computer systems | 1998

A methodology and an evaluation of the SGI Origin2000

Dongming Jiang; Jaswinder Pal Singh

As hardware-coherent, distributed shared memory (DSM) multiprocessing becomes popular commercially, it is important to evaluate modern realizations to understand how they perform and scale for a range of interesting applications and to identify the nature of the key bottlenecks. This paper evaluates the SGI Origin2000---the machine that perhaps has the most aggressive communication architecture of the recent cache-coherent offerings---and, in doing so, articulates a sound methodology for evaluating real systems. We examine data access and synchronization microbenchmarks; speedups for different application classes, problem sizes and scaling models; detailed interactions and time breakdowns using performance tools; and the impact of special hardware support. We find that overall the Origin appears to deliver on the promise of cache-coherent shared address space multiprocessing, at least at the 32-processor scale we examine. The machine is quite easy to program for performance and has fewer organizational problems than previous systems we have examined. However, some important trouble spots are also identified, especially related to contention that is apparently caused by engineering decisions to share resources among processors.

international conference on supercomputing | 1999

Application scaling under shared virtual memory on a cluster of SMPs

Dongming Jiang; Brian O'Kelley; Xiang Yu; Sanjeev Kumar; Angelos Bilas; Jaswinder Pal Singh

In this paper we examine how application performance scales on a state-of-the-art shared virtual memory (SVM) system on a cluster with 64 processors, comprising 4-way SMPs connected with a fast system area network. The protocol we use is home-based and takes advantage of general-purpose data movement and mutual exclusion support provided by a programmable network interface. We find that while the level of application restructuring needed is quite high compared to applications that perform well on a hardware-coherent system of this scale, and larger problem sizes are needed for good performance, SVM, surprisingly, performs quite well at the 64-processor scale for a fairly wide range of applications, achieving at least half the parallel efficiency of a high-end hardware-coherent system and often much more. We explore further application restructurings than those developed earlier for smaller-scale SVM systems, examine the main remaining system and application bottlenecks, and point out directions for future research.

ACM Transactions on Computer Systems | 2001

Accelerating shared virtual memory via general-purpose network interface support

Angelos Bilas; Dongming Jiang; Jaswinder Pal Singh

Clusters of symmetric multiprocessors (SMPs) are important platforms for high-performance computing. With the success of hardware cache-coherent distributed shared memory (DSM), a lot of effort has also been made to support the coherent shared-address-space programming model in software on clusters. Much research has been done in fast communication on clusters and in protocols for supporting software shared memory across them. However, the performance of software virtual memory (SVM) is still far from that achieved on hardware DSM systems. The goal of this paper is to improve the performance of SVM on system area network clusters by considering communication and protocol layer interactions. We first examine what are the important communication system bottlenecks that stand in the way of improving parallel performance of SVM clusters; in particular, which parameters of the communication architecture are most important to improve further relative to processor speed, which ones are already adequate on modern systems for most applications, and how will this change with technology in the future. We find that the most important communication subsystem cost to improve is the overhead of generating and delivery interrupts for asynchronous protocol processing. Then we proceed to show, that by providing simple and general support for asynchronous message handling in a commodity network interface (NI) and by altering SVM protocols appropriately, protocol activity can be decoupled from asynchronous message handling, and the need for interrupts or polling can be eliminated. The NI mechanisms needed are generic, not SVM-dependent. We prototype the mechanisms and such a synchronous home-based LRC protocol, called GeNIMA (GEneral-purpose Network Interface support for shared Memory Abstractions), on a cluster of SMPs with a programmable NI. We find that the performance improvements are substantial, bringing performance on a small-scale SMP cluster much closer to that of hardware-coherent shared memory for many applications, and we show the value of each of the mechanisms in different applications.

high-performance computer architecture | 1999

Limits to the performance of software shared memory: a layered approach

Angelos Bilas; Dongming Jiang; Yuanyuan Zhou; Jaswinder Pal Singh

Much research has been done in fast communication on clusters and in protocols for supporting software shared memory across them. However, the end performance of applications that were written for the more proven hardware-coherent shared memory is still not very good on these systems. Three major layers of software (and hardware) stand between the end user and parallel performance, each with its own functionality and performance characteristics. They include the communication layer, the software protocol layer that supports the programming model, and the application layer. These layers provide a useful framework to identify the key remaining limitations and bottlenecks in software shared memory systems, as well as the areas where optimization efforts might yield the greatest performance improvements. This paper performs such an integrated study, using this layered framework, for two types of software distributed shared memory systems: page-based shared virtual memory (SVM) and fine-grained software systems (FG). For the two system layers (communication and protocol), we focus on the performance costs of basic operations in the layers rather than on their functionalities. This is possible because their functionalities are now fairly mature. The less mature applications layer is treated through application restructuring. We examine the layers individually and in combination, understanding their implications for the two types of protocols and exposing the synergies among layers.

acm sigplan symposium on principles and practice of parallel programming | 1997

Improving parallel shear-warp volume rendering on shared address space multiprocessors

Dongming Jiang; Jaswinder Pal Singh

This paper presents a new parallel volume rendering algorithm and implementation, based on shear warp factorization, for shared address space multiprocessors. Starting from an existing parallel shear-warp renderer, we use increasingly detailed performance measurements on real machines and simulators to understand performance bottlenecks. This leads us to a new parallel implementation that substantially outperforms and out-scales the old one on a range of shared address space platforms, from bus-based centralized memory machine to hardware-coherent distributed memory machines to networks of computers connected by page-based shared virtual memory. The results demonstrate that real time volume rendering is promising on general purpose multiprocessors, and illustrate the utility of tool hierarchies in conjunction with algorithmic and application knowledge to understand memory system interactions and improve parallel algorithms.

Journal of Parallel and Distributed Computing | 2003

Shared virtual memory clusters: bridging the cost-performance gap between SMPs and hardware DSM systems

Angelos Bilas; Dongming Jiang; Jaswinder Pal Singh

Although the shared memory abstraction is gaining ground as a programming abstraction for parallel computing, the main platforms that support it, small-scale symmetric multiprocessors (SMPs) and hardware cache-coherent distributed shared memory systems (DSMs), seem to lie inherently at the extremes of the cost-performance spectrum for parallel systems. In this paper we examine if shared virtual memory (SVM) clusters can bridge this gap by examining how application performance scales on a state-of-the-art shared virtual memory cluster. We find that: (i) The level of application restructuring needed is quite high compared to applications that perform well on a DSM system of the same scale and larger problem sizes are needed for good performance. (ii) However, surprisingly, SVM performs quite well for a fairly wide range of applications, achieving at least half the parallel efficiency of a high-end DSM system at the same scale and often much more.

Archive | 2002