Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Weiqun Zhang is active.

Publication


Featured researches published by Weiqun Zhang.


international supercomputing conference | 2013

Software Design Space Exploration for Exascale Combustion Co-design

Cy P. Chan; Didem Unat; Michael J. Lijewski; Weiqun Zhang; John B. Bell; John Shalf

The design of hardware for next-generation exascale computing systems will require a deep understanding of how software optimizations impact hardware design trade-offs. In order to characterize how co-tuning hardware and software parameters affects the performance of combustion simulation codes, we created ExaSAT, a compiler-driven static analysis and performance modeling framework. Our framework can evaluate hundreds of hardware/software configurations in seconds, providing an essential speed advantage over simulators and dynamic analysis techniques during the co-design process. Our analytic performance model shows that advanced code transformations, such as cache blocking and loop fusion, can have a significant impact on choices for cache and memory architecture. Our modeling helped us identify tuned configurations that achieve a 90% reduction in memory traffic, which could significantly improve performance and reduce energy consumption. These techniques will also be useful for the development of advanced programming models and runtimes, which must reason about these optimizations to deliver better performance and energy efficiency.


ieee international conference on high performance computing data and analytics | 2015

ExaSAT: An exascale co-design tool for performance modeling

Didem Unat; Cy P. Chan; Weiqun Zhang; Samuel Williams; John Bachan; John B. Bell; John Shalf

One of the emerging challenges to designing HPC systems is understanding and projecting the requirements of exascale applications. In order to determine the performance consequences of different hardware designs, analytic models are essential because they can provide fast feedback to the co-design centers and chip designers without costly simulations. However, current attempts to analytically model program performance typically rely on the user manually specifying a performance model. We introduce the ExaSAT framework that automates the extraction of parameterized performance models directly from source code using compiler analysis. The parameterized analytic model enables quantitative evaluation of a broad range of hardware design trade-offs and software optimizations on a variety of different performance metrics, with a primary focus on data movement as a metric. We demonstrate the ExaSAT framework’s ability to perform deep code analysis of a proxy application from the Department of Energy Combustion Co-design Center to illustrate its value to the exascale co-design process. ExaSAT analysis provides insights into the hardware and software trade-offs and lays the groundwork for exploring a more targeted set of design points using cycle-accurate architectural simulators.


Combustion Theory and Modelling | 2014

High-order algorithms for compressible reacting flow with complex chemistry

Matthew Emmett; Weiqun Zhang; John B. Bell

In this paper we describe a numerical algorithm for integrating the multicomponent, reacting, compressible Navier–Stokes equations, targeted for direct numerical simulation of combustion phenomena. The algorithm addresses two shortcomings of previous methods. First, it incorporates an eighth-order narrow stencil approximation of diffusive terms that reduces the communication compared to existing methods and removes the need to use a filtering algorithm to remove Nyquist frequency oscillations that are not damped with traditional approaches. The methodology also incorporates a multirate temporal integration strategy that provides an efficient mechanism for treating chemical mechanisms that are stiff relative to fluid dynamical time-scales. The overall methodology is eighth order in space with options for fourth order to eighth order in time. The implementation uses a hybrid programming model designed for effective utilisation of many-core architectures. We present numerical results demonstrating the convergence properties of the algorithm with realistic chemical kinetics and illustrating its performance characteristics. We also present a validation example showing that the algorithm matches detailed results obtained with an established low Mach number solver.


SIAM Journal on Scientific Computing | 2016

BoxLib with Tiling: An Adaptive Mesh Refinement Software Framework

Weiqun Zhang; Ann S. Almgren; Marcus S. Day; Tan Nguyen; John Shalf; Didem Unat

In this paper we introduce a block-structured adaptive mesh refinement software framework that incorporates tiling, a well-known loop transformation. Because the multiscale, multiphysics codes built in BoxLib are designed to solve complex systems at high resolution, performance on current and next generation architectures is essential. With the expectation of many more cores per node on next generation architectures, the ability to effectively utilize threads within a node is essential, and the current model for parallelization will not be sufficient. We describe a new version of BoxLib in which the tiling constructs are embedded so that BoxLib-based applications can easily realize expected performance gains without extra effort on the part of the application developer. We also discuss a path forward to enable future versions of BoxLib to take advantage of NUMA-aware optimizations using the TiDA portable library.


ieee international conference on high performance computing, data, and analytics | 2016

TiDA: High-Level Programming Abstractions for Data Locality Management

Didem Unat; Tan Nguyen; Weiqun Zhang; Muhammed Nufail Farooqi; Burak Bastem; George Michelogiannakis; Ann S. Almgren; John Shalf

The high energy costs for data movement compared to computation gives paramount importance to data locality management in programs. Managing data locality manually is not a trivial task and also complicates programming. Tiling is a well-known approach that provides both data locality and parallelism in an application. However, there is no standard programming construct to express tiling at the application level. We have developed a multicore programming model, TiDA, based on tiling and implemented the model as C++ and Fortran libraries. The proposed programming model has three high level abstractions, tiles, regions and tile iterator. These abstractions in the library hide the details of data decomposition, cache locality optimizations, and memory affinity management in the application. In this paper we unveil the internals of the library and demonstrate the performance and programability advantages of the model on five applications on multiple NUMA nodes. The library achieves up to 2.10x speedup over OpenMP in a single compute node for simple kernels, and up to 22x improvement over a single thread for a more complex combustion proxy application (SMC) on 24 cores. The MPI+TiDA implementation of geometric multigrid demonstrates a 30.9 % performance improvement over MPI+OpenMP when scaling to 3072 cores (excluding MPI communication overheads, 8.5 % otherwise).


ieee international conference on high performance computing data and analytics | 2016

Perilla: metadata-based optimizations of an asynchronous runtime for adaptive mesh refinement

Tan Nguyen; Didem Unat; Weiqun Zhang; Ann S. Almgren; Nufail Farooqi; John Shalf

Hardware architecture is increasingly complex, urging the development of asynchronous runtime systems with advance resource and locality management supports. However, these supports may come at the cost of complicating the user interface while programming remains one of the major constraints to wide adoption of asynchronous runtimes in practice. In this paper, we propose a solution that leverages application metadata to enable challenging optimizations as well as to facilitate the task of transforming legacy code to an asynchronous representation. We develop Perilla, a task graph-based runtime system that requires only modest programming effort. Perilla utilizes metadata of an AMR software framework to enable various optimizations at the communication layer without complicating its API. Experimental results with different applications on up to 24K processor cores show that Perilla can realize up to 1.44x speedup over the synchronous code variant. The metadata enabled optimizations account for 25% to 100% of the performance improvement.


international conference on parallel processing | 2017

Overlapping Data Transfers with Computation on GPU with Tiles

Burak Bastem; Didem Unat; Weiqun Zhang; Ann S. Almgren; John Shalf

GPUs are employed to accelerate scientific applications however they require much more programming effort from the programmers particularly because of the disjoint address spaces between the host and the device. OpenACC and OpenMP 4.0 provide directive based programming solutions to alleviate the programming burden however synchronous data movement can create a performance bottleneck in fully taking advantage of GPUs. We propose a tiling based programming model and its library that simplifies the development of GPU programs and overlaps the data movement with computation. The programming model decomposes the data and computation into tiles and treats them as the main data transfer and execution units, which enables pipelining the transfers to hide the transfer latency. Moreover, partitioning application data into tiles allows the programmer to still take advantage of GPU even though application data cannot fit into the device memory. The library leverages C++ lambda functions, OpenACC directives, CUDA streams and tiling API from TiDA to support both productivity and performance. We show the performance of the library on a data transfer-intensive and a compute-intensive kernels and compare its speedup against OpenACC and CUDA. The results indicate that the library can hide the transfer latency, handle the cases where there is no sufficient device memory, and achieves reasonable performance.


european conference on parallel processing | 2017

Nonintrusive AMR Asynchrony for Communication Optimization

Muhammed Nufail Farooqi; Didem Unat; Tan Nguyen; Weiqun Zhang; Ann S. Almgren; John Shalf

Adaptive Mesh Refinement (AMR) is a well known method for efficiently solving partial differential equations. A straightforward AMR algorithm typically exhibits many synchronization points even during a single time step, where costly communication often degrades the performance. This problem will be even more pronounced on future supercomputers containing billion way parallelism, which will raise the communication cost further. Re-designing AMR algorithms to avoid synchronization is not a viable solution due to the large code size and complex control structures. We present a nonintrusive asynchronous approach to hiding the effects of communication in an AMR application. Specifically, our approach reasons about data dependencies automatically using domain knowledge about AMR applications, allowing asynchrony to be discovered with only a modest amount of code modification. Using this approach, we optimize the synchronous AMR algorithm in the BoxLib software framework without severely affecting the productivity of the application programmer. We observe around 27–31% performance improvement for an advection solver on the Hazel Hen supercomputer using 12288 cores.


Proceedings of the First Workshop on PGAS Applications | 2016

Experiences of applying one-sided communication to nearest-neighbor communication

Hongzhang Shan; Samuel Williams; Yili Zheng; Weiqun Zhang; Bei Wang; Stephane Ethier; Zhengji Zhao

Nearest-neighbor communication is one of the most important communication patterns appearing in many scientific applications. In this paper, we discuss the results of applying UPC++, a library-based partitioned global address space (PGAS) programming extension to C++, to an adaptive mesh framework (BoxLib), and a full scientific application GTC-P, whose communications are dominated by the nearest-neighbor communication. The results on a Cray XC40 system show that compared with the highly-tuned MPI two-sided implementations, UPC++ improves the communication performance up to 60% and 90% for BoxLib and GTC-P, respectively. We also implement the nearest-neighbor communication using MPI one-sided messages. The performance comparison demonstrates that the MPI one-sided implementation can also improve the communication performance over the two-sided version but not so significantly as UPC++ does.


Archive | 2013

Tiling as a Durable Abstraction for Parallelism and Data Locality

Didem Unat; Cy P. Chan; Weiqun Zhang; John B. Bell; John Shalf

Collaboration


Dive into the Weiqun Zhang's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

John Shalf

Lawrence Berkeley National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Ann S. Almgren

Lawrence Berkeley National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Tan Nguyen

Lawrence Berkeley National Laboratory

View shared research outputs
Top Co-Authors

Avatar

John B. Bell

Lawrence Berkeley National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Cy P. Chan

Lawrence Berkeley National Laboratory

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Samuel Williams

Lawrence Berkeley National Laboratory

View shared research outputs
Top Co-Authors

Avatar

George Michelogiannakis

Lawrence Berkeley National Laboratory

View shared research outputs
Researchain Logo
Decentralizing Knowledge