Xinliang Wang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xinliang Wang is active.

Explore More

Publication

Featured researches published by Xinliang Wang.

IEEE Transactions on Computers | 2015

Ultra-Scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe-2

Wei Xue; Chao Yang; Haohuan Fu; Xinliang Wang; Yangtong Xu; Junfeng Liao; Lin Gan; Yutong Lu; Rajiv Ranjan; Lizhe Wang

In this work an ultra-scalable algorithm is designed and optimized to accelerate a 3D compressible Euler atmospheric model on the CPU-MIC hybrid system of Tianhe-2. We first reformulate the mesocale model to avoid long-latency operations, and then employ carefully designed inter-node and intra-node domain decomposition algorithms to achieve balance utilization of different computing units. Proper communication-computation overlap and concurrent data transfer methods are utilized to reduce the cost of data movement at scale. A variety of optimization techniques on both the CPU side and the accelerator side are exploited to enhance the in-socket performance. The proposed hybrid algorithm successfully scales to 6,144 Tianhe-2 nodes with a nearly ideal weak scaling efficiency, and achieve over 8 percent of the peak performance in double precision. This ultra-scalable hybrid algorithm may be of interest to the community to accelerating atmospheric models on increasingly dominated heterogeneous supercomputers.

international parallel and distributed processing symposium | 2014

Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe-2

Wei Xue; Chao Yang; Haohuan Fu; Xinliang Wang; Yangtong Xu; Lin Gan; Yutong Lu; Xiaoqian Zhu

This paper presents a hybrid algorithm for the petascale global simulation of atmospheric dynamics on Tianhe-2, the worlds current top-ranked supercomputer developed by Chinas National University of Defense Technology (NUDT). Tianhe-2 is equipped with both Intel Xeon CPUs and Intel Xeon Phi accelerators. A key idea of the hybrid algorithm is to enable flexible domain partition between an arbitrary number of processors and accelerators, so as to achieve a balanced and efficient utilization of the entire system. We also present an asynchronous and concurrent data transfer scheme to reduce the communication overhead between CPU and accelerators. The acceleration of our global atmospheric model is conducted to improve the use of the Intel MIC architecture. For the single-node test on Tianhe-2 against two Intel Ivy Bridge CPUs (24 cores), we can achieve 2.07×, 3.18×, and 4.35× speedups when using one, two, and three Intel Xeon Phi accelerators respectively. The average performance gain from SIMD vectorization on the Intel Xeon Phi processors is around 5× (out of the 8× theoretical case). Based on successful computation-communication overlapping, large-scale tests indicate that a nearly ideal weak-scaling efficiency of 93.5% is obtained when we gradually increase the number of nodes from 6 to 8,664 (nearly 1.7 million cores). In the strong-scaling test, the parallel efficiency is about 77% when the number of nodes increases from 1,536 to 8,664 for a fixed 65,664 × 5,664 × 6 mesh with 77.6 billion unknowns.

ieee international conference on high performance computing data and analytics | 2016

10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics

Chao Yang; Wei Xue; Haohuan Fu; Hongtao You; Xinliang Wang; Yulong Ao; Fangfang Liu; Lin Gan; Ping Xu; Lanning Wang; Guangwen Yang; Weimin Zheng

An ultra-scalable fully-implicit solver is developed for stiff time-dependent problems arising from the hyperbolic conservation laws in nonhydrostatic atmospheric dynamics. In the solver, we propose a highly efficient hybrid domain-decomposed multigrid preconditioner that can greatly accelerate the convergence rate at the extreme scale. For solving the overlapped subdomain problems, a geometry-based pipelined incomplete LU factorization method is designed to further exploit the on-chip fine-grained concurrency. We perform systematic optimizations on different hardware levels to achieve best utilization of the heterogeneous computing units and substantial reduction of data movement cost. The fully-implicit solver successfully scales to the entire system of the Sunway TaihuLight supercomputer with over 10.5M heterogeneous cores, sustaining an aggregate performance of 7.95 PFLOPS in double-precision, and enables fast and accurate atmospheric simulations at the 488-m horizontal resolution (over 770 billion unknowns) with 0.07 simulated-years-per-day. This is, to our knowledge, the largest fully-implicit simulation to date.

ieee international conference on high performance computing data and analytics | 2016

Refactoring and optimizing the community atmosphere model (CAM) on the sunway taihulight supercomputer

Haohuan Fu; Junfeng Liao; Wei Xue; Lanning Wang; Dexun Chen; Long Gu; Jinxiu Xu; Nan Ding; Xinliang Wang; Conghui He; Shizhen Xu; Yishuang Liang; Jiarui Fang; Yuanchao Xu; Weijie Zheng; Jingheng Xu; Zhen Zheng; Wanjing Wei; Xu Ji; He Zhang; Bingwei Chen; Kaiwei Li; Xiaomeng Huang; Wenguang Chen; Guangwen Yang

This paper reports our efforts on refactoring and optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight supercomputer, which uses a many-core processor that consists of management processing elements (MPEs) and clusters of computing processing elements (CPEs). To map the large code base of CAM to the millions of cores on the Sunway system, we take OpenACC-based refactoring as the major approach, and apply source-to-source translator tools to exploit the most suitable parallelism for the CPE cluster, and to fit the intermediate variable into the limited on-chip fast buffer. For individual kernels, when comparing the original ported version using only MPEs and the refactored version using both the MPE and CPE clusters, we achieve up to 22× speedup for the compute-intensive kernels. For the 25km resolution CAM global model, we manage to scale to 24,000 MPEs, and 1,536,000 CPEs, and achieve a simulation speed of 2.81 model years per day.

international conference on parallel and distributed systems | 2014

Scaling and analyzing the stencil performance on multi-core and many-core architectures

Lin Gan; Haohuan Fu; Wei Xue; Yangtong Xu; Chao Yang; Xinliang Wang; Zihong Lv; Yang You; Guangwen Yang; Kaijian Ou

Stencils are among the most important and time-consuming kernels in many applications. While stencil optimization has been a well-studied topic on CPU platforms, achieving higher performance and efficiency for the evolving numerical stencils on the more recent multi-core and many-core architectures is still an important issue. In this paper, we explore a number of different stencils, ranging from a basic 7-point Jacobi stencil to more complex high-order stencils used in finer numerical simulations. By optimizing and analyzing those stencils on the latest multi-core and many-core architectures (the Intel Sandy Bridge processor, the Intel Xeon Phi coprocessor, and the NVIDIA Fermi C2070 and Kepler K20x GPUs), we investigate the algorithmic and architectural factors that determine the performance and efficiency of the resulting designs. While multi-threading, vectorization, and optimization on cache and other fast buffers are still the most important techniques that provide performance, we observe that the different memory hierarchy and the different mechanism for issuing and executing parallel instructions lead to the different performance behaviors on CPU, MIC and GPU. With vector-like processing units becoming the major provider of computing power on almost all architectures, the compilers inability to align all the computing and memory operations would become the major bottleneck from getting a high efficiency on current and future platforms. Our specific optimization of the complex WNAD stencil on GPU provides a good example of what the compiler could do to help.

Biochemistry | 2002

Subunit interaction slows the unfolding of the N-terminal domain of creatine kinase in urea.

Zhenhua Guo; Zidong Wang; Xinliang Wang

Fluorescence emission intensity changes with two different excitation wavelengths were used to measure the unfolding rate constants of different domains of muscle type creatine kinase (CK-MM) according to the heterogeneity of aromatic amino acid distributions in the crystal structure of CK-MM. The results were compared with those of brain type creatine kinase (CK-BB) and dithio-bis(succinimidyl propionate) cross-linked CK-MM. CK-BB differed greatly in its distribution of aromatic amino acids in each domain and the unfolding process of cross-linked CK-MM was not accompanied by the dissociation of the dimer. The N-terminal domain of CK-MM was shown to be well protected by subunit interaction during the unfolding of CK-MM in 4 M urea. Dissociating the CK dimer in high urea concentration (≥6 M) eliminated the subunit protection. Subunit interactions are also important in preserving secondary structure and forming contracted conformation at low urea concentration.

international parallel and distributed processing symposium | 2017

26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight

Yulong Ao; Chao Yang; Xinliang Wang; Wei Xue; Haohuan Fu; Fangfang Liu; Lin Gan; Ping Xu; Wenjing Ma

Stencil computation arises from a broad set of scientific and engineering applications and often plays a critical role in the performance of extreme-scale simulations. Due to the memory bound nature, it is a challenging task to opti- mize stencil computation kernels on modern supercomputers with relatively high computing throughput whilst relatively low data-moving capability. This work serves as a demon- stration on the details of the algorithms, implementations and optimizations of a real-world stencil computation in 3D nonhydrostatic atmospheric modeling on the newly announced Sunway TaihuLight supercomputer. At the algorithm level, we present a computation-communication overlapping technique to reduce the inter-process communication overhead, a locality- aware blocking method to fully exploit on-chip parallelism with enhanced data locality, and a collaborative data accessing scheme for sharing data among different threads. In addition, a variety of effective hardware specific implementation and optimization strategies on both the process- and thread-level, from the fine-grained data management to the data layout transformation, are developed to further improve the per- formance. Our experiments demonstrate that a single-process many-core speedup of as high as 170x can be achieved by using the proposed algorithm and optimization strategies. The code scales well to millions of cores in terms of strong scalability. And for the weak-scaling tests, the code can scale in a nearly ideal way to the full system scale of more than 10 million cores, sustaining 25.96 PFLOPS in double precision, which is 20% of the peak performance.

international parallel and distributed processing symposium | 2016

A Fast Tridiagonal Solver for Intel MIC Architecture

Xinliang Wang; Wei Xue; Jidong Zhai; Yangtong Xu; Weimin Zheng; Hai-Xiang Lin

The tridiagonal solver is an important kernel in many scientific and engineering applications. Although quite a few parallel algorithms have been exploited recently, challenges still remain when solving tridiagonal systems on many-core architectures. In this paper, quantitative analysis is conducted to guide the selection of algorithms on different architectures, and a fast register-PCR-pThomas algorithm for Intel MIC architecture is presented to achieve the best utilization of in-core vectorization and registers. By choosing the most suitable number of PCR reductions, we further propose an improved algorithm, named register-PCR-half-pThomas algorithm, which minimizes the computational cost and the number of registers for use. The optimizations of manual prefetching and strength reduction are also discussed for the new algorithm. Evaluation on Intel Xeon Phi 7110P shows that our register-PCR-pThomas can outperform the MKL solver by 7.7 times in single precision and 3.4 times in double precision. Moreover, our optimized register-PCR-half-pThomas can further boost the performance by another 43.9% in single precision and 20.4% in double precision. Our best algorithm can outperform the latest official GPU library (cuSPARSE) on Nvidia K40 by 3.7 times and 2.4 times in single and double precision, respectively.

cluster computing and the grid | 2016

Generalized GPU Acceleration for Applications Employing Finite-Volume Methods

Jingheng Xu; Haohuan Fu; Lin Gan; Chao Yang; Wei Xue; Shizhen Xu; Wenlai Zhao; Xinliang Wang; Bingwei Chen; Guangwen Yang

Scientific HPC applications are increasingly ported to GPUs to benefit from both the high throughput and the powerful computing capacity. Many of these applications, such as atmospheric modeling and hydraulic erosion simulation, are adopting the finite volume method (FVM) as the solver algorithm. However, the communication components inside these applications generally lead to a low flop-to-byte ratio and an inefficient utilization of GPU resources. This paper aims at optimizing FVM solver based on the structured mesh. Besides a high-level overview of the finite-volume method as well as its basic optimizations on modern GPU platforms, we further present two generalized tuning techniques including an explicit cache mechanism as well as an inner-thread rescheduling method that tries to achieve a suitable mapping between the algorithm feature and the platform architecture. To the end, we demonstrate the impact of our generalized optimization methods in two typical atmospheric dynamic kernels (Euler and SWE) based on four mainstream GPU platforms. According to the experimental results of Tesla K80, speedups of 24.4x for SWE and 31.5x for Euler could be achieved over a 12-core Intel E5-2697 CPU, which is a great promotion compared with its original speedup (18x and 15.47x) without applying these two methods.

Biochemistry | 2002

p-Chloromercuribenzoate-induced inactivation and partial unfolding of porcine heart lactate dehydrogenase.

Yan-bin Zheng; Bao-Yu Chen; Xinliang Wang

Purified porcine heart lactate dehydrogenase was inactivated and partially unfolded with p-chloromercuribenzoate (pCMB). With the increase of pCMB/enzyme ratio the enzyme was gradually inhibited till almost completely inactivated at the pCMB/enzyme ratio of 20 : 1. Native polyacrylamide gel electrophoresis showed that with the increase of pCMB/enzyme ratio the bands of native enzyme decreased till completely vanished. Meanwhile inactive multiple bands emerged and became thicker, which implied that lactate dehydrogenase became loose. The conformational changes of the enzyme molecule modified with pCMB were followed using fluorescence emission, ultraviolet difference, and circular dichroism (CD) spectra. Increasing pCMB concentration resulted in the decrease of fluorescence emission intensity. The ultraviolet difference spectra of the enzyme modified with pCMB exhibited an increasing absorbance in the vicinity of 240 nm with the increasing concentration of the inhibitor. The changes of the fluorescence and ultraviolet difference spectra reflected the conformational changes of the enzyme. The CD spectrum changes of the enzyme showed that its secondary structure changed as well. These results suggest that pCMB not only inhibits this enzyme but also influences its conformation (partial unfolding).

Explore More