Is this you? Create Your Porfile

Qiang Wu

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Qiang Wu is active.

Explore More

Publication

Featured researches published by Qiang Wu.

Journal of Parallel and Distributed Computing | 2013

Exploiting hierarchy parallelism for molecular dynamics on a petascale heterogeneous system

Qiang Wu; Canqun Yang; Tao Tang; Liquan Xiao

Heterogeneous systems with nodes containing more than one type of computation units, e.g., central processing units (CPUs) and graphics processing units (GPUs), are becoming popular because of their low cost and high performance. In this paper, we have developed a Three-Level Parallelization Scheme (TLPS) for molecular dynamics (MD) simulation on heterogeneous systems. The scheme exploits multi-level parallelism combining (1) inter-node parallelism using spatial decomposition via message passing, (2) intra-node parallelism using spatial decomposition via dynamically scheduled multi-threading, and (3) intra-chip parallelism using multi-threading and short vector extension in CPUs, and employing multiple CUDA threads in GPUs. By using a hierarchy of parallelism with optimizations such as communication hiding intra-node, and memory optimizations in both CPUs and GPUs, we have implemented and evaluated a MD simulation on a petascale heterogeneous supercomputer TH-1A. The results show that MD simulations can be efficiently parallelized with our TLPS scheme and can benefit from the optimizations.

Journal of Computer Science and Technology | 2014

OpenMC: Towards Simplifying Programming for TianHe Supercomputers

XiangKe Liao; Can-Qun Yung; Tao Tang; Huizhan Yi; Feng Wang; Qiang Wu; Jingling Xue

Modern petascale and future exascale systems are massively heterogeneous architectures. Developing productive intra-node programming models is crucial toward addressing their programming challenge. We introduce a directive-based intra-node programming model, OpenMC, and show that this new model can achieve ease of programming, high performance, and the degree of portability desired for heterogeneous nodes, especially those in TianHe supercomputers. While existing models are geared towards offloading computations to accelerators (typically one), OpenMC aims to more uniformly and adequately exploit the potential offered by multiple CPUs and accelerators in a compute node. OpenMC achieves this by providing a unified abstraction of hardware resources as workers and facilitating the exploitation of asynchronous task parallelism on the workers. We present an overview of OpenMC, a prototyping implementation, and results from some initial comparisons with OpenMP and hand-written code in developing six applications on two types of nodes from TianHe supercomputers.

Proceedings of the First International Workshop on Code OptimiSation for MultI and many Cores | 2013

MIC acceleration of short-range molecular dynamics simulations

Qiang Wu; Canqun Yang; Tao Tang; Liquan Xiao

Heterogeneous systems containing accelerators such as GPUs or co-processors such as Intel MIC are becoming more prevalent due to their ability of exploiting large-scale parallelism in applications. In this paper, we have developed a hierarchical parallelization scheme for molecular dynamics simulations on CPU-MIC heterogeneous systems. The scheme exploits multi-level parallelism combining (1) task-level parallelism using a tightly-coupled division method, (2) thread-level parallelism employing spatial-decomposition through dynamically scheduled multi-threading, and (3) data-level parallelism via SIMD technology. By employing a hierarchy of parallelism with several optimization methods such as memory latency hiding and data pre-fetching, our MD code running on a CPU-MIC heterogeneous system (one 2.60GHZ eight-core Intel Xeon E5-2670 CPU and one 57-core Intel Knight Corner co-processor) achieves (1) multi-thread parallel efficiency of 72.4% for 57 threads on the co-processor with up to 7.62 times SIMD speedup on each core for the force computation task, and (2) up to 2.25 times speedup on the CPU-MIC system over the pure CPU system, which outperforms our previous work on a CPU-GPU (one NVIDIA Tesla M2050) platform. Our work shows that MD simulations can benefit enormously from the CPU-MIC heterogeneous platforms.

international parallel and distributed processing symposium | 2012

A Fast Parallel Implementation of Molecular Dynamics with the Morse Potential on a Heterogeneous Petascale Supercomputer

Qiang Wu; Canqun Yang; Feng Wang; Jingling Xue

Molecular Dynamics (MD) simulations have been widely used in the study of macromolecules. To ensure an acceptable level of statistical accuracy relatively large number of particles are needed, which calls for high performance implementations of MD. These days heterogeneous systems, with their high performance potential, low power consumption, and high price-performance ratio, offer a viable alternative for running MD simulations. In this paper we introduce a fast parallel implementation of MD simulation with the Morse potential on Tianhe-1A, a petascale heterogeneous supercomputer. Our code achieves a speedup of 3.6× on one NVIDIA Tesla M2050 GPU (containing 14 Streaming Multiprocessors) compared to a 2.93GHz six-core Intel Xeon X5670 CPU. In addition, our code runs faster on 1024 compute nodes (with two CPUs and one GPU inside a node) than on 4096 GPU-excluded nodes, effectively rendering one GPU more efficient than six six-core CPUs. Our work shows that large-scale MD simulations can benefit enormously from GPU acceleration in petascale supercomputing platforms. Our performance results are achieved by using (1) a patch-cell design to exploit parallelism across the simulation domain, (2) a new GPU kernel developed by taking advantage of Newtons Third Law to reduce redundant force computation on GPUs, (3) two optimization methods including a dynamic load balancing strategy that adjusts the workload, and a communication overlapping method to overlap the communications between CPUs and GPUs.

international conference on computer science and education | 2012

Constant memory optimizations in MD5 Crypt cracking algorithm on GPU-accelerated supercomputer using CUDA

Feng Wang; Canqun Yang; Qiang Wu; Zhicai Shi

MD5 Crypt is a cryptographic algorithm used commonly in UNIX system for authentication. By using the additional randomization of the salt and complexity of the scheme, it makes the traditional password cracking techniques invalid on common computing systems and the security of the system is guaranteed. Benefited from the thriving of petaflops heterogeneous supercomputer system recently, such as Tianhe-1A, the security of MD5 Crypt is facing a threat of Brute Force Attack again. Many works have been done on the GPU-accelerated platform to improve the performance of MD5 Crypt. However, little increase has been achieved by using the constant memory of CUDA architecture. This paper explores this problem and archived 44.6% improvement by allocating constant memory to the padding array. And this paper presents a high scalable implementation of Brute Force Attack Algorithm of MD5 Crypt on Tianhe-1A, which is the fastest heterogeneous supercomputer of the world. The experimental results have shown that 326 thousands MD5 hashes could be checked per second on one single computing node and outperform 5.7X than the CPU version. On multi-nodes, the implementation also shows a great scalability. Consequently, it issued a new challenge to the security of MD5 crypt for authentication.

Tsinghua Science & Technology | 2012

Fast parallel cutoff pair interactions for molecular dynamics on heterogeneous systems

Qiang Wu; Canqun Yang; Tao Tang; Kai Lu

Heterogeneous systems with both Central Processing Units (CPUs) and Graphics Processing Units (GPUs) are frequently used to accelerate short-ranged Molecular Dynamics (MD) simulations. The most time-consuming task in short-ranged MD simulations is the computation of particle-to-particle interactions. Beyond a certain distance, these interactions decrease to zero. To minimize the operations to investigate distance, previous works have tiled interactions by employing the spatial attribute, which increases the memory access and GPU computations, hence decreasing performance. Other studies ignore the spatial attribute and construct an all-versus-all interaction matrix, which has poor scalability. This paper presents an improved algorithm. The algorithm first bins particles into voxels according to the spatial attributes, and then tiles the all-versus-all matrix into voxel-versus-voxel sub-matrixes. Only the sub-matrixes between neighboring voxels are computed on the GPU. Therefore, the algorithm reduces the distance examine operations and limits additional memory access and GPU computations. This paper also adopts a multi-level programming model to implement the algorithm on multi-nodes of Tianhe-lA. By employing (1) a patch design to exploit parallelism across the simulation domain, (2) a communication overlapping method to overlap the communications between CPUs and GPUs, and (3) a dynamic workload balancing method to adjust the workloads among compute nodes, the implementation achieves a speedup of 4.16× on one NVIDIA Tesla M2050 GPU compared to a 2.93 GHz six-core Intel Xeon X5670 CPU. In addition, it runs 2.41× faster on 256 compute nodes of Tianhe-lA (with two CPUs and one GPU inside a node) than on 256 GPU-excluded nodes.

unconventional high performance computing | 2009

Accelerating PQMRCGSTAB algorithm on GPU

Canqun Yang; Zhen Ge; Juan Chen; Feng Wang; Qiang Wu

The general computations on GPU are becoming more and more popular because of GPUs powerful computing ability. In this paper, how to use GPU to accelerate sparse linear system solver, preconditioned QMRCGSTAB (PQMRCGSTAB for short), is our concern. We implemented a GPU-accelerated PQMRCGSTAB algorithm on NVIDIA Tesla C870. Three optimization methods are used to improve the performance of the GPU-accelerated PQMRCGSTAB algorithm: reorganizing data by data packing and matrix mending to obtain higher memory bandwidth; using texture memory instead of shared memory; exploiting the kernel mergence. The experimental results show that the GPU-accelerated PQMRCGSTAB algorithm achieves the peak performance of 17.7 GFLOPS on C870. Compared with a MPI version PQMRCGSTAB algorithm executed on an Intel Xeon quad-core CPU, GPU-accelerated PQMRCGSTAB can reach the speedup of five for a 640Kx640K matrix.

Advanced Materials Research | 2013

Accelerating PQMRCGSTAB Algorithm on Xeon Phi

Cheng Chen; Can Qun Yang; Wen Ke Yao; Jin Qi; Qiang Wu

Utilizing iterative method to solve the large sparse linear systems is the key to many practical mathematical and physical problems. Recently, Intel released Xeon Phi, a many-core processor of Intel’s Many Integrated Core (MIC) architecture, comprises 60 cores and supports 512-bit SIMD operation. In this work, we aim at accelerating an iterative algorithm for large spare linear system, named PQMRCGSTAB, by using both Xeon Phi’s 8-way vector operation and dense threads. Then, we propose three optimizations to improve the performance: data prefetching to hide the data latency, vector register reusing, and SIMD-friendly reduction. Our experimental evaluation on Xeon Phi delivers a speedup of close to a factor 6 compared to the Intel Xeon E5-2670 octal-core CPU running the same problem.

CCF National Conference on Compujter Engineering and Technology | 2013

OpenACC to Intel Offload: Automatic Translation and Optimization

Cheng Chen; Canqun Yang; Tao Tang; Qiang Wu; Pengfei Zhang

Heterogeneous architectures with both conventional CPUs and coprocessors become popular in the design of High Performance Computing systems. The programming problems on such architectures are frequently studied. OpenACC standard is proposed to tackle the problem by employing directive-based high-level programming for coprocessors. In this paper, we take advantage of OpenACC to program on the newly Intel MIC coprocessor. We achieve this by automatically translating the OpenACC source code to Intel Offload code. Two optimizations including communication and SIMD optimization are employed. Two kernels i.e. the matrix multiplication and JACOBI, are studied on the MIC-based platform (one knight Corner card) and the GPU-based platform (one NVIDIA Tesla k20c card). Performance evaluation shows that both kernels delivers a speedup of approximately 3 on one knight Corner card than on one Intel Xeon E5-2670 octal-core CPU. Moreover, the two kernels gain better performance on MIC-based platform than on the GPU-based one.

Advanced Materials Research | 2013

Accelerating IDCT Algorithm on Xeon Phi Coprocessor

Jin Qi; Can Qun Yang; Cheng Chen; Qiang Wu; Tao Tang

Inverse Discrete Cosine Transform (IDCT) is an important operation for image and videos decompression. How to accelerate the IDCT algorithm has been frequently studied. Recently Intel has proposed Xeon Phi coprocessors based on the many integrated core (MIC) architecture. Xeon Phi is integrated with 61 cores and 512-bit SIMD extension within each core, thus providing very high performance. In this paper, we employ the Knights Corner (a beta version of Xeon Phi) to accelerate the IDCT algorithm. By employing the 512-bit SIMD instruction and data pre-fetching optimization, our implementation achieves (1) averagely 5.82 speedup over the none-SIMD version, (2) averagely 27.3% performance benefit with the data pre-fetching optimization, and (3) averagely 1.53 speedup on one Knights Corner coprocessor over the implementation on one octal-core Intel Xeon E5-2670 CPU.

Explore More