Is this you? Create Your Porfile

Canqun Yang

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Canqun Yang is active.

Explore More

Publication

Featured researches published by Canqun Yang.

international conference on cluster computing | 2010

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing

Canqun Yang; Feng Wang; Yunfei Du; Juan Chen; Jie Liu; Huizhan Yi; Kai Lu

In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is presented to balance the workload distribution across the GPUs and CPUs with the negligible runtime overhead, resulting in the better performance than the static or the training partitioning methods. The CPU-GPU communication overhead is effectively hidden by a software pipelining technique, which is particularly useful for large memory-bound applications. Combined with other traditional optimizations, the Linpack we optimized using the adaptive optimization framework achieved 196.7 GFLOPS on a single compute element of TianHe-1. This result is 70.1% of the peak compute capability and 3.3 times faster than the result using the vendor’s library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0.563PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list released in November 2009.

Journal of Computer Science and Technology | 2011

Optimizing linpack benchmark on GPU-accelerated petascale supercomputer

Feng Wang; Canqun Yang; Yunfei Du; Juan Chen; Huizhan Yi; Weixia Xu

In this paper we present the programming of the Linpack benchmark on TianHe-1 system, the first petascale supercomputer system of China, and the largest GPU-accelerated heterogeneous system ever attempted before. A hybrid programming model consisting of MPI, OpenMP and streaming computing is described to explore the task parallel, thread parallel and data parallel of the Linpack. We explain how we optimized the load distribution across the CPUs and GPUs using the two-level adaptive method and describe the implementation in details. To overcome the low-bandwidth between the CPU and GPU communication, we present a software pipelining technique to hide the communication overhead. Combined with other traditional optimizations, the Linpack we developed achieved 196:7 GFLOPS on a single compute element of TianHe-1. This result is 70:1% of the peak compute capability, 3:3 times faster than the result by using the vendors library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0:563 PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list in November, 2009.

Journal of Parallel and Distributed Computing | 2013

Exploiting hierarchy parallelism for molecular dynamics on a petascale heterogeneous system

Qiang Wu; Canqun Yang; Tao Tang; Liquan Xiao

Heterogeneous systems with nodes containing more than one type of computation units, e.g., central processing units (CPUs) and graphics processing units (GPUs), are becoming popular because of their low cost and high performance. In this paper, we have developed a Three-Level Parallelization Scheme (TLPS) for molecular dynamics (MD) simulation on heterogeneous systems. The scheme exploits multi-level parallelism combining (1) inter-node parallelism using spatial decomposition via message passing, (2) intra-node parallelism using spatial decomposition via dynamically scheduled multi-threading, and (3) intra-chip parallelism using multi-threading and short vector extension in CPUs, and employing multiple CUDA threads in GPUs. By using a hierarchy of parallelism with optimizations such as communication hiding intra-node, and memory optimizations in both CPUs and GPUs, we have implemented and evaluated a MD simulation on a petascale heterogeneous supercomputer TH-1A. The results show that MD simulations can be efficiently parallelized with our TLPS scheme and can benefit from the optimizations.

parallel computing | 2012

Parallelizing SOR for GPGPUs using alternate loop tiling

Peng Di; Hui Wu; Jingling Xue; Feng Wang; Canqun Yang

Gauss-Seidel and SOR, which are widely used smoothers in multigrid methods, are difficult to parallelize, particularly on GPGPUs due to the existence of DOACROSS data dependences. In this paper, we present a new parallel SOR method that admits more efficient data-parallel SIMD execution than red-black SOR on GPGPUs. Our solution is obtained non-conventionally, by starting from a K-layer SOR method and then parallelizing it by applying a non-dependence-preserving scheme consisting of a new domain decomposition technique followed by a loop tiling technique called alternate tiling. Despite its relatively slower convergence, our new method outperforms red-black SOR by making a better balance between data reuse and parallelism and by trading off convergence rate for SIMD parallelism. Our experimental results highlight the importance of synergy between domain experts, compiler optimizations and performance tuning in maximizing the performance of PDE-like DOACROSS loops on GPGPUs.

Proceedings of the First International Workshop on Code OptimiSation for MultI and many Cores | 2013

MIC acceleration of short-range molecular dynamics simulations

Qiang Wu; Canqun Yang; Tao Tang; Liquan Xiao

Heterogeneous systems containing accelerators such as GPUs or co-processors such as Intel MIC are becoming more prevalent due to their ability of exploiting large-scale parallelism in applications. In this paper, we have developed a hierarchical parallelization scheme for molecular dynamics simulations on CPU-MIC heterogeneous systems. The scheme exploits multi-level parallelism combining (1) task-level parallelism using a tightly-coupled division method, (2) thread-level parallelism employing spatial-decomposition through dynamically scheduled multi-threading, and (3) data-level parallelism via SIMD technology. By employing a hierarchy of parallelism with several optimization methods such as memory latency hiding and data pre-fetching, our MD code running on a CPU-MIC heterogeneous system (one 2.60GHZ eight-core Intel Xeon E5-2670 CPU and one 57-core Intel Knight Corner co-processor) achieves (1) multi-thread parallel efficiency of 72.4% for 57 threads on the co-processor with up to 7.62 times SIMD speedup on each core for the force computation task, and (2) up to 2.25 times speedup on the CPU-MIC system over the pure CPU system, which outperforms our previous work on a CPU-GPU (one NVIDIA Tesla M2050) platform. Our work shows that MD simulations can benefit enormously from the CPU-MIC heterogeneous platforms.

international parallel and distributed processing symposium | 2012

A Fast Parallel Implementation of Molecular Dynamics with the Morse Potential on a Heterogeneous Petascale Supercomputer

Qiang Wu; Canqun Yang; Feng Wang; Jingling Xue

Molecular Dynamics (MD) simulations have been widely used in the study of macromolecules. To ensure an acceptable level of statistical accuracy relatively large number of particles are needed, which calls for high performance implementations of MD. These days heterogeneous systems, with their high performance potential, low power consumption, and high price-performance ratio, offer a viable alternative for running MD simulations. In this paper we introduce a fast parallel implementation of MD simulation with the Morse potential on Tianhe-1A, a petascale heterogeneous supercomputer. Our code achieves a speedup of 3.6× on one NVIDIA Tesla M2050 GPU (containing 14 Streaming Multiprocessors) compared to a 2.93GHz six-core Intel Xeon X5670 CPU. In addition, our code runs faster on 1024 compute nodes (with two CPUs and one GPU inside a node) than on 4096 GPU-excluded nodes, effectively rendering one GPU more efficient than six six-core CPUs. Our work shows that large-scale MD simulations can benefit enormously from GPU acceleration in petascale supercomputing platforms. Our performance results are achieved by using (1) a patch-cell design to exploit parallelism across the simulation domain, (2) a new GPU kernel developed by taking advantage of Newtons Third Law to reduce redundant force computation on GPUs, (3) two optimization methods including a dynamic load balancing strategy that adjusts the workload, and a communication overlapping method to overlap the communications between CPUs and GPUs.

international conference on computer science and education | 2012

Constant memory optimizations in MD5 Crypt cracking algorithm on GPU-accelerated supercomputer using CUDA

Feng Wang; Canqun Yang; Qiang Wu; Zhicai Shi

MD5 Crypt is a cryptographic algorithm used commonly in UNIX system for authentication. By using the additional randomization of the salt and complexity of the scheme, it makes the traditional password cracking techniques invalid on common computing systems and the security of the system is guaranteed. Benefited from the thriving of petaflops heterogeneous supercomputer system recently, such as Tianhe-1A, the security of MD5 Crypt is facing a threat of Brute Force Attack again. Many works have been done on the GPU-accelerated platform to improve the performance of MD5 Crypt. However, little increase has been achieved by using the constant memory of CUDA architecture. This paper explores this problem and archived 44.6% improvement by allocating constant memory to the padding array. And this paper presents a high scalable implementation of Brute Force Attack Algorithm of MD5 Crypt on Tianhe-1A, which is the fastest heterogeneous supercomputer of the world. The experimental results have shown that 326 thousands MD5 hashes could be checked per second on one single computing node and outperform 5.7X than the CPU version. On multi-nodes, the implementation also shows a great scalability. Consequently, it issued a new challenge to the security of MD5 crypt for authentication.

Tsinghua Science & Technology | 2012

Fast parallel cutoff pair interactions for molecular dynamics on heterogeneous systems

Qiang Wu; Canqun Yang; Tao Tang; Kai Lu

Heterogeneous systems with both Central Processing Units (CPUs) and Graphics Processing Units (GPUs) are frequently used to accelerate short-ranged Molecular Dynamics (MD) simulations. The most time-consuming task in short-ranged MD simulations is the computation of particle-to-particle interactions. Beyond a certain distance, these interactions decrease to zero. To minimize the operations to investigate distance, previous works have tiled interactions by employing the spatial attribute, which increases the memory access and GPU computations, hence decreasing performance. Other studies ignore the spatial attribute and construct an all-versus-all interaction matrix, which has poor scalability. This paper presents an improved algorithm. The algorithm first bins particles into voxels according to the spatial attributes, and then tiles the all-versus-all matrix into voxel-versus-voxel sub-matrixes. Only the sub-matrixes between neighboring voxels are computed on the GPU. Therefore, the algorithm reduces the distance examine operations and limits additional memory access and GPU computations. This paper also adopts a multi-level programming model to implement the algorithm on multi-nodes of Tianhe-lA. By employing (1) a patch design to exploit parallelism across the simulation domain, (2) a communication overlapping method to overlap the communications between CPUs and GPUs, and (3) a dynamic workload balancing method to adjust the workloads among compute nodes, the implementation achieves a speedup of 4.16× on one NVIDIA Tesla M2050 GPU compared to a 2.93 GHz six-core Intel Xeon X5670 CPU. In addition, it runs 2.41× faster on 256 compute nodes of Tianhe-lA (with two CPUs and one GPU inside a node) than on 256 GPU-excluded nodes.

Neurocomputing | 2018

A hybrid deep learning CNN–ELM for age and gender classification

Mingxing Duan; Kenli Li; Canqun Yang; Keqin Li

Automatic age and gender classification has been widely used in a large amount of applications, particularly in human-computer interaction, biometrics, visual surveillance, electronic customer, and commercial applications. In this paper, we introduce a hybrid structure which includes Convolutional Neural Network (CNN) and Extreme Learning Machine (ELM), and integrates the synergy of two classifiers to deal with age and gender classification. The hybrid architecture makes the most of their advantages: CNN is used to extract the features from the input images while ELM classifies the intermediate results. We not only give the detailed deployment of our structure including design of parameters and layers, analysis of the hybrid architecture, and the derivation of back-propagation in this system during the iterations, but also adopt several measures to limit the risk of overfitting. After that, two popular datasets, such as, MORPH-II and Adience Benchmark, are used to verify our hybrid structure. Experimental results show that our hybrid architecture outperforms other studies on the same datasets by exhibiting significant performance improvement in terms of accuracy and efficiency.

Computing | 2017

LU factorization on heterogeneous systems: an energy-efficient approach towards high performance

Cheng Chen; Jianbin Fang; Tao Tang; Canqun Yang

Dense lower–upper (LU) factorization (hereafter referred to as LU) is a critical kernel that is widely used to solve dense linear algebra problems. Hybrid LU algorithms have been well designed to exploit the full capacity of heterogeneous systems. However, existing heterogeneous implementations are typically CPU-centric, which rely highly on CPU cores and suffer from a large amount of data transfers via the PCIe bus, and thus reduce the overall energy efficiency of the entire computer system. In this paper, we provide a coprocessor-resident implementation of LU for a heterogeneous platform to improve energy efficiency by relieving the CPUs from performing heavy load computations and avoiding excessive data transfers via PCIe. To maintain the performance, we conduct optimizations to pipeline the CPU computation, coprocessor computation, MPI communication, and PCIe transfer between the CPUs and coprocessors. The experiments on the Tianhe-2 supercomputer show that our LU implementation can compete with the highly optimized Intel MKL implementation in performance and overcome the limitations of energy efficiency.

Explore More