Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Yongpeng Zhang is active.

Publication


Featured researches published by Yongpeng Zhang.


symposium on code generation and optimization | 2012

Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Yongpeng Zhang; Frank Mueller

This paper develops and evaluates search and optimization techniques for auto-tuning 3D stencil (nearest-neighbor) computations on GPUs. Observations indicate that parameter tuning is necessary for heterogeneous GPUs to achieve optimal performance with respect to a search space. Our proposed framework takes a most concise specification of stencil behavior from the user as a single formula, auto-generates tunable code from it, systematically searches for the best configuration and generates the code with optimal parameter configurations for different GPUs. This auto-tuning approach guarantees adaptive performance for different generations of GPUs while greatly enhancing programmer productivity. Experimental results show that the delivered floating point performance is very close to previous handcrafted work and outperforms other auto-tuned stencil codes by a large margin.


IEEE Transactions on Parallel and Distributed Systems | 2013

Autogeneration and Autotuning of 3D Stencil Codes on Homogeneous and Heterogeneous GPU Clusters

Yongpeng Zhang; Frank Mueller

This paper develops and evaluates search and optimization techniques for autotuning 3D stencil (nearest neighbor) computations on GPUs. Observations indicate that parameter tuning is necessary for heterogeneous GPUs to achieve optimal performance with respect to a search space. Our proposed framework takes a most concise specification of stencil behavior from the user as a single formula, autogenerates tunable code from it, systematically searches for the best configuration and generates the code with optimal parameter configurations for different GPUs. This autotuning approach guarantees adaptive performance for different generations of GPUs while greatly enhancing programmer productivity. Experimental results show that the delivered floating point performance is very close to previous handcrafted work and outperforms other autotuned stencil codes by a large margin. Furthermore, heterogeneous GPU clusters are shown to exhibit the highest performance for dissimilar tuning parameters leveraging proportional partitioning relative to single-GPU performance.


international conference on parallel processing | 2011

GStream: A General-Purpose Data Streaming Framework on GPU Clusters

Yongpeng Zhang; Frank Mueller

Emerging accelerating architectures, such as GPUs, have proved successful in providing significant performance gains to various application domains. However, their viability to operate on general streaming data is still ambiguous. In this paper, we propose GStream, a general-purpose, scalable data streaming framework on GPUs. The contributions of GStream are as follows: (1) We provide powerful, yet concise language abstractions suitable to describe conventional algorithms as streaming problems. (2)We project these abstractions onto GPUs to fully exploit their inherent massive data parallelism.(3) We demonstrate the viability of streaming on accelerators. Experiments show that the proposed framework provides flexibility, programmability and performance gains for various benchmarks from a collection of domains, including but not limited to data streaming, data parallel problems and numerical codes.


symposium on code generation and optimization | 2013

Hidp: A hierarchical data parallel language

Frank Mueller; Yongpeng Zhang

Problem domains are commonly decomposed hierarchically to fully utilize parallel resources in modern microprocessors. Such decompositions can be provided as library routines, written by experienced experts, for general algorithmic patterns. But such APIs tend to be constrained to certain architectures or data sizes. Integrating them with application code is often an unnecessarily daunting task, especially when these routines need to be closely coupled with user code to achieve better performance. This paper contributes HiDP, a high-level hierarchical data parallel language. The purpose of HiDP is to improve the coding productivity of integrating hierarchical data parallelism without significant loss of performance. HiDP is a source-to-source compiler that converts a very concise data parallel language into CUDA C++ source code. Internally, it performs necessary analysis to compose user code with efficient and architecture-aware code snippets. This paper discusses various aspects of HiDP systematically: the language, the compiler and the run-time system with built-in tuning capabilities. They enable HiDP users to express algorithms in less code than low-level SDKs require for native platforms. HiDP also exposes abundant computing resources of modern parallel architectures. Improved coding productivity tends to come with a sacrifice in performance. Yet, experimental results show that the generated code delivers performance very close to handcrafted native GPU code.


international parallel and distributed processing symposium | 2010

Large-scale multi-dimensional document clustering on GPU clusters

Yongpeng Zhang; Frank Mueller; Xiaohui Cui; Thomas E. Potok

Document clustering plays an important role in data mining systems. Recently, a flocking-based document clustering algorithm has been proposed to solve the problem through simulation resembling the flocking behavior of birds in nature. This method is superior to other clustering algorithms, including k-means, in the sense that the outcome is not sensitive to the initial state. One limitation of this approach is that the algorithmic complexity is inherently quadratic in the number of documents. As a result, execution time becomes a bottleneck with large number of documents. In this paper, we assess the benefits of exploiting the computational power of Beowulf-like clusters equipped with contemporary Graphics Processing Units (GPUs) as a means to significantly reduce the runtime of flocking-based document clustering. Our framework scales up to over one million documents processed simultaneously in a sixteen-node moderate GPU cluster. Results are also compared to a four-node cluster with higher-end GPUs. On these clusters, we observe 30X-50X speedups, which demonstrate the potential of GPU clusters to efficiently solve massive data mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge.


Archive | 2009

GPU-Accelerated Text Mining

Yongpeng Zhang; Frank Mueller; Xiaohui Cui; Thomas E. Potok


international conference on parallel processing | 2012

CuNesl: Compiling Nested Data-Parallel Languages for SIMT Architectures

Yongpeng Zhang; Frank Mueller


Archive | 2009

A Programming Model for Massive Data Parallelism with Data Dependencies

Xiaohui Cui; Frank Mueller; Thomas E. Potok; Yongpeng Zhang


Journal of Parallel and Distributed Computing | 2010

Data-Intensive Document Clustering on GPU Clusters ✩,✩✩

Yongpeng Zhang; Frank Mueller; Xiaohui Cui; Thomas E. Potok


Archive | 2012

Exploiting data-parallelism in gpus

Frank Mueller; Yongpeng Zhang

Collaboration


Dive into the Yongpeng Zhang's collaboration.

Top Co-Authors

Avatar

Frank Mueller

North Carolina State University

View shared research outputs
Top Co-Authors

Avatar

Thomas E. Potok

Oak Ridge National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Xiaohui Cui

Oak Ridge National Laboratory

View shared research outputs
Researchain Logo
Decentralizing Knowledge