Bei Wang
Princeton University
                                 Network
                            
                            Latest external collaboration on country level. Dive into details by clicking on the dots.
                                 Publication
                            
                            Featured researches published by Bei Wang.
ieee international conference on high performance computing data and analytics | 2013
Bei Wang; Stephane Ethier; William Tang; Timothy J. Williams; Khaled Z. Ibrahim; Kamesh Madduri; Samuel Williams; Leonid Oliker
Reliable predictive simulation capability addressing confinement properties in magnetically confined fusion plasmas is critically-important for ITER, a 20 billion dollar international burning plasma device under construction in France. The complex study of kinetic turbulence, which can severely limit the energy confinement and impact the economic viability of fusion systems, requires simulations at extreme scale for such an unprecedented device size. Our newly optimized, global, ab initio particle-in-cell code solving the nonlinear equations underlying gyrokinetic theory achieves excellent performance with respect to “time to solution” at the full capacity of the IBM Blue Gene/Q on 786,432 cores of Mira at ALCF and recently of the 1,572,864 cores of Sequoia at LLNL. Recent multithreading and domain decomposition optimizations in the new GTC-P code represent critically important software advances for modern, low memory per core systems by enabling routine simulations at unprecedented size (130 million grid points ITER-scale) and resolution (65 billion particles).
ieee international conference on high performance computing data and analytics | 2013
Khaled Z. Ibrahim; Kamesh Madduri; Samuel Williams; Bei Wang; Stephane Ethier; Leonid Oliker
The Gyrokinetic Toroidal Code (GTC) uses the particle-in-cell method to efficiently simulate plasma microturbulence. This work presents novel analysis and optimization techniques to enhance the performance of GTC on large-scale machines. We introduce cell access analysis to better manage locality vs. synchronization tradeoffs on CPU and GPU-based architectures. Our optimized hybrid parallel implementation of GTC uses MPI, OpenMP, and NVIDIA CUDA, achieves up to a 2× speedup over the reference Fortran version on multiple parallel systems, and scales efficiently to tens of thousands of cores.
International Journal of High Performance Computing Applications | 2017
Bei Wang; Stephane Ethier; William Tang; Khaled Z. Ibrahim; Kamesh Madduri; Samuel Williams; Leonid Oliker
The gyrokinetic toroidal code at Princeton (GTC-P) is a highly scalable and portable particle-in-cell (PIC) code. It solves the 5-D Vlasov–Poisson equation featuring efficient utilization of modern parallel computer architectures at the petascale and beyond. Motivated by the goal of developing a modern code capable of dealing with the physics challenge of increasing problem size with sufficient resolution, new thread-level optimizations have been introduced as well as a key additional domain decomposition. GTC-P’s multiple levels of parallelism, including internode 2-D domain decomposition and particle decomposition, as well as intranode shared memory partition and vectorization, have enabled pushing the scalability of the PIC method to extreme computational scales. In this article, we describe the methods developed to build a highly parallelized PIC code across a broad range of supercomputer designs. This particularly includes implementations on heterogeneous systems using NVIDIA GPU accelerators and Intel Xeon Phi (MIC) coprocessors and performance comparisons with state-of-the-art homogeneous HPC systems such as Blue Gene/Q. New discovery science capabilities in the magnetic fusion energy application domain are enabled, including investigations of ion–temperature–gradient driven turbulence simulations with unprecedented spatial resolution and long temporal duration. Performance studies with realistic fusion experimental parameters are carried out on multiple supercomputing systems spanning a wide range of cache capacities, cache-sharing configurations, memory bandwidth, interconnects, and network topologies. These performance comparisons using a realistic discovery-science-capable domain application code provide valuable insights on optimization techniques across one of the broadest sets of current high-end computing platforms worldwide.
Computing in Science and Engineering | 2014
W. M. Tang; Bei Wang; Stephane Ethier
The primary goal of the extreme-scale plasma turbulence studies described in this article is to gain new insights on confinement scaling in magnetic fusion systems by using powerful, world-class supercomputers to carry out simulations with unprecedented resolution and temporal duration. New insights have been gained on the key question of how the turbulent transport of thermal energy and particles in the plasma and the associated confinement scale from present-generation devices to much larger ITER-size plasmas. In particular, new results from large-scale simulation studies have demonstrated that improvement in confinement as devices grow larger is far more gradual, with significantly lower loss rates than less-powerful computer simulations have indicated in research carried out over the past decade.
parallel computing | 2016
Keisuke Tsugane; Taisuke Boku; Hitoshi Murai; Mitsuhisa Sato; W. M. Tang; Bei Wang
We propose the hybrid-view programming approach in PGAS language XcalableMP.We port Gyrokinetic Toroidal Code - Princeton (GTC-P) to XcalableMP.The comparison of the performance and productivity with XMP and MPI implementations.Hybrid-view implementation increases the readability of the code. Recently, the Partitioned Global Address Space (PGAS) parallel programming model has emerged as a usable distributed memory programming model. XcalableMP (XMP) is a PGAS parallel programming language that extends base languages such as C and Fortran with directives in OpenMP-like style. XMP supports a global-view model that allows programmers to define global data and to map them to a set of processors, which execute the distributed global data as a single thread. In XMP, the concept of a coarray is also employed for local-view programming. In this study, we port Gyrokinetic Toroidal Code - Princeton (GTC-P), which is a three-dimensional gyrokinetic PIC code developed at Princeton University to study the microturbulence phenomenon in magnetically confined fusion plasmas, to XMP as an example of hybrid memory model coding with the global-view and local-view programming models. In local-view programming, the coarray notation is simple and intuitive compared with Message Passing Interface (MPI) programming while the performance is comparable to that of the MPI version. Thus, because the global-view programming model is suitable for expressing the data parallelism for a field of grid space data, we implement a hybrid-view version using a global-view programming model to compute the field and a local-view programming model to compute the movement of particles. The performance is degraded by 20% compared with the original MPI version, but the hybrid-view version facilitates more natural data expression for static grid space data (in the global-view model) and dynamic particle data (in the local-view model), and it also increases the readability of the code for higher productivity.
ieee international conference on high performance computing data and analytics | 2016
William Tang; Bei Wang; Stephane Ethier; Grzegorz Kwasniewski; Torsten Hoefler; Khaled Z. Ibrahim; Kamesh Madduri; Samuel Williams; Leonid Oliker; Carlos Rosales-Fernandez; Timothy J. Williams
The goal of the extreme scale plasma turbulence studies described in this paper is to expedite the delivery of reliable predictions on confinement physics in large magnetic fusion systems by using world-class supercomputers to carry out simulations with unprecedented resolution and temporal duration. This has involved architecture-dependent optimizations of performance scaling and addressing code portability and energy issues, with the metrics for multi-platform comparisons being “time-to-solution” and “energy-to-solution”. Realistic results addressing how confinement losses caused by plasma turbulence scale from present-day devices to the much larger
parallel and distributed computing: applications and technologies | 2016
Yueming Wei; Yichao Wang; Linjin Cai; William Tang; Bei Wang; Stephane Ethier; Simon See; James Lin
25 billion international ITER fusion facility have been enabled by innovative advances in the GTC-P code including (i) implementation of one-sided communication from MPI 3.0 standard; (ii) creative optimization techniques on Xeon Phi processors; and (iii) development of a novel performance model for the key kernels of the PIC code. Results show that modeling data movement is sufficient to predict performance on modern supercomputer platforms.
Journal of Medicinal Chemistry | 2015
Chunjian Liu; James Lin; John Hynes; Hong Wu; Stephen T. Wrobleski; Shuqun Lin; T. G. Murali Dhar; Jung-Hui Sun; Sam T. Chao; Rulin Zhao; Bei Wang; Bang-Chi Chen; Gerry Everlof; Christoph Gesenberg; Hongjian Zhang; Punit Marathe; Kim W. McIntyre; Tracy L. Taylor; Kathleen M. Gillooly; David J. Shuster; Murray McKinnon; John H. Dodd; Joel C. Barrish; Gary L. Schieven; Katerina Leftheris
Accelerator-based heterogeneous computing is of paramount importance to High Performance Computing. The increasing complexity of the cluster architectures requires more generic, high-level programming models. OpenACC is a directive-based parallel programming model, which provides performance on and portability across a wide variety of platforms, including GPU, multicore CPU, and many-core processors. GTC-P is a discovery-science-capable real-world application code based on the Particle-In-Cell (PIC) algorithm that is well-established in the HPC area. Several native versions of GTC-P have been developed for supercomputers on TOP500 with different architectures, including Titan, Mira, etc. Motivated by the state-of-art portability, we implemented the first OpenACC version of GTC-P and evaluated its performance portability across NVIDIA GPUs, Intel x86 and OpenPOWER CPUs. In this paper, we also proposed two key optimization methods for OpenACC implementation of PIC algorithm on multicore CPU and GPU including removing atomic operation and taking advantage of shared memory. OpenACC shows both impressive productivity and performance in a perspective of portability and scalability. The OpenACC version achieves more than 90% performance compared with the native versions with only about 300 LOC.
Proceedings of the First Workshop on PGAS Applications | 2016
Hongzhang Shan; Samuel Williams; Yili Zheng; Weiqun Zhang; Bei Wang; Stephane Ethier; Zhengji Zhao
In search for prodrugs to address the issue of pH-dependent solubility and exposure associated with 1 (BMS-582949), a previously disclosed phase II clinical p38α MAP kinase inhibitor, a structurally novel clinical prodrug, 2 (BMS-751324), featuring a carbamoylmethylene linked promoiety containing hydroxyphenyl acetic acid (HPA) derived ester and phosphate functionalities, was identified. Prodrug 2 was not only stable but also water-soluble under both acidic and neutral conditions. It was effectively bioconverted into parent drug 1 in vivo by alkaline phosphatase and esterase in a stepwise manner, providing higher exposure of 1 compared to its direct administration, especially within higher dose ranges. In a rat LPS-induced TNFα pharmacodynamic model and a rat adjuvant arthritis model, 2 demonstrated similar efficacy to 1. Most importantly, it was shown in clinical studies that prodrug 2 was indeed effective in addressing the pH-dependent absorption issue associated with 1.
ieee international conference on high performance computing data and analytics | 2012
Bei Wang; Stéphane Either; William Tang; Khaled Z. Ibrahim; Kamesh Madduri; Samuel Williams; Leonid Oliker; Timothy J. Williams
Nearest-neighbor communication is one of the most important communication patterns appearing in many scientific applications. In this paper, we discuss the results of applying UPC++, a library-based partitioned global address space (PGAS) programming extension to C++, to an adaptive mesh framework (BoxLib), and a full scientific application GTC-P, whose communications are dominated by the nearest-neighbor communication. The results on a Cray XC40 system show that compared with the highly-tuned MPI two-sided implementations, UPC++ improves the communication performance up to 60% and 90% for BoxLib and GTC-P, respectively. We also implement the nearest-neighbor communication using MPI one-sided messages. The performance comparison demonstrates that the MPI one-sided implementation can also improve the communication performance over the two-sided version but not so significantly as UPC++ does.
