Changjun Hu
University of Science and Technology Beijing
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Changjun Hu.
Computer Physics Communications | 2017
Changjun Hu; Xianmeng Wang; Jianjiang Li; Xinfu He; Shigang Li; Yangde Feng; Shaofeng Yang; He Bai
Abstract To optimize short-range force computations in Molecular Dynamics (MD) simulations, multi-threading and SIMD optimizations are presented in this paper. With respect to multi-threading optimization, a Partition-and-Separate-Calculation (PSC) method is designed to avoid write conflicts caused by using Newton’s third law. Serial bottlenecks are eliminated with no additional memory usage. The method is implemented by using the OpenMP model. Furthermore, the PSC method is employed on Intel Xeon Phi coprocessors in both native and offload models. We also evaluate the performance of the PSC method under different thread affinities on the MIC architecture. In the SIMD execution, we explain the performance influence in the PSC method, considering the “ if-clause ” of the cutoff radius check. The experiment results show that our PSC method is relatively more efficient compared to some traditional methods. In double precision, our 256-bit SIMD implementation is about 3 times faster than the scalar version.
Science in China Series F: Information Sciences | 2015
Shigang Li; Changjun Hu; JunChao Zhang; YunQuan Zhang
To have good performance and scalability, parallel applications should be sophisticatedly optimized to exploit intra-node parallelism and reduce inter-node communication on multicore clusters. This paper investigates the automatic tuning of the sparse matrix-vector (SpMV) multiplication kernel implemented in a partitioned global address space language, which supports a hybrid thread- and process-based communication layer for multicore systems. One-sided communication is used for inter-node data exchange, while intra-node communication uses a mix of process shared memory and multithreading. We develop performance models to facilitate selecting the best configuration of threads and processes hybridization as well as the best communication pattern for SpMV. As a result, our tuned SpMV in the hybrid runtime environment consumes less memory and reduces inter-node communication volume, without damaging the data locality. Experiments are conducted on 12 real sparse matrices. On 16-node Xeon and 8-node Opteron clusters, our tuned SpMV kernel gets on average 1.4X and 1.5X improvement in performance over the well-optimized process-based message-passing implementation, respectively.抽象创新点为了获得理想的性能及可扩展性, 并行应用通常需要精细调优, 以更好地利用多核集 群节点内部的高度并行性, 并减少节点间通信开销. 本文研究了多核集群上稀疏矩阵向量(SpMV) 乘法的自动调优技术, 其中SpMV代码基于划分全局地址空间(PGAS)语言UPC实现. UPC 通信层支持多线程/多进程混合运行时环境, 其中节点间数据交换通过单边通信实现, 而节点内通信通过 PSHM(Process SHare Memory) 以及多线程进行优化. 本文为此类混合运行时环境 (如UPC) 建立通信性能模型, 并基于该模型为 SpMV 选择最优混合运行时配置参数以及通信模式, 在保证数据局部性的前提下, 减少内存开销及节点间通信量. 通过对 12个实际稀疏矩阵进行实验测试表明, 相对于高度手工优化的 MPI SpMV 实现, 自动调优后的 SpMV 在 16 节点至强集群及 8 节点皓龙集群上分别获得 1.4 倍及 1.5 倍性能提升.
Science in China Series F: Information Sciences | 2010
Jue Wang; Changjun Hu; Jilin Zhang; Jianjiang Li
OpenMP is an emerging industry standard for shared memory architectures. While OpenMP has advantages on its ease of use and incremental programming, message passing is today still the most widely-used programming model for distributed memory architectures. How to effectively extend OpenMP to distributed memory architectures has been a hot spot. This paper proposes an OpenMP system, called KLCoMP, for distributed memory architectures. Based on the “partially replicating shared arrays” memory model, we propose an algorithm for shared array recognition based on the inter-procedural analysis, optimization technique based on the producer/consumer relationship, and communication generation technique for nonlinear references. We evaluate the performance on nine benchmarks which cover computational fluid dynamics, integer sorting, molecular dynamics, earthquake simulation, and computational chemistry. The average scalability achieved by KLCoMP version is close to that achieved by MPI version. We compare the performance of our translated programs with that of versions generated for Omni+SCASH, LLCoMP, and OpenMP(Purdue), and find that parallel applications (especially, irregular applications) translated by KLCoMP can achieve more effective performance than other versions.
network and parallel computing | 2008
Changjun Hu; Yewei Shao; Jue Wang; Jianjiang Li
Message-passing is a predominant programming paradigm for distributed memory systems. RDMA networks like infiniBand and Myrinet reduce communication overhead by overlapping communication with computation. For the overlap to be more effective, we propose a source-to-source transformation scheme by automatically restructuring message-passing codes. The extensions to control-flow graph can accurately analyze the message-passing program and help perform data-flow analysis effectively. This analysis identifies the minimal region between producer and consumer, which contains message-passing functional calls. Using inter-procedural data-flow analysis, the transformation scheme enables the overlap of communication with computation. Experiments on the well-known NAS Parallel Benchmarks show that for distributed memory systems, versions employing communication-computation overlap are faster than original programs.
parallel computing in electrical engineering | 2006
Changjun Hu; Jing Li; Jue Wang; Yonghong Li; Liang Ding; Jianjiang Li
Irregular computing significantly influences the performance of large scale parallel applications. How to generate local memory access sequence and communication set efficiently for irregular parallel application is an important issue in compiling a data parallel language into a single program multiple data (SPMD) code for distributed-memory machines. In this paper, we propose a hybrid approach, which combines the advantages of the algebra method and the integer lattice method. Our approach derives an algebraic solution of communication set enumeration at compile time for the situation of irregular array references in nested loops. Based on the integer lattice, we develop our method for global-to-local and local-to-global index translations in the case of alignment and cyclic (k) distribution. Then, we present our algorithm for the corresponding SPMD code generation, which adopts some communication optimization techniques. In our method, when parameters are known, the communication set generation, the global-to-local and local-to-global index translations, and the SPMD code generation can be completed at compile time without inspector phase of run time
Computer Physics Communications | 2017
Changjun Hu; He Bai; Xinfu He; Boyao Zhang; Ningming Nie; Xianmeng Wang; Yingwen Ren
Abstract Material irradiation effect is one of the most important keys to use nuclear power. However, the lack of high-throughput irradiation facility and knowledge of evolution process, lead to little understanding of the addressed issues. With the help of high-performance computing, we could make a further understanding of micro-level-material. In this paper, a new data structure is proposed for the massively parallel simulation of the evolution of metal materials under irradiation environment. Based on the proposed data structure, we developed the new molecular dynamics software named Crystal MD. The simulation with Crystal MD achieved over 90% parallel efficiency in test cases, and it takes more than 25% less memory on multi-core clusters than LAMMPS and IMD, which are two popular molecular dynamics simulation software. Using Crystal MD, a two trillion particles simulation has been performed on Tianhe-2 cluster.
Future Generation Computer Systems | 2010
Jue Wang; Changjun Hu; Jilin Zhang; Jianjiang Li
For many parallel applications on distributed memory systems, array re-decomposition is usually required to enhance data locality and reduce the communication overheads. How to effectively schedule messages to improve the performance of array re-decomposition has received much attention in recent years. This paper is devoted to develop efficient scheduling algorithms using the compiling information provided by array distribution patterns, array alignment patterns and the periodic property of array accesses. Our algorithms not only avoid inter-processor contention, but also reduces real communication cost and communication generation time. The experimental results show that the performance of array redecomposition can be significantly improved using our algorithms
international conference on parallel architectures and compilation techniques | 2007
Changjun Hu; Jilin Zhang; Jue Wang; Jianjiang Li; Liang Ding
To take advantage of the supercomputing resource with multiple processors, several parallel versions of the Gauss-Seidel (SOR) method have been proposed. In the present study, a new parallel Gauss-Seidel algorithm is developed based on domain decomposition and convergence iteration space alternate tiling method for solution of system of linear equations related to finite difference discretization of partial differential equations. The goal of this method is to improve three different performance aspects: inter-iteration data locality, intra-iteration data locality and parallelism. Intra-iteration locality refers to cache locality upon data reuse within convergence iteration, and inter-iteration locality refers to cache locality upon data reuse between convergence iterations.
international conference on move to meaningful internet systems | 2007
Changjun Hu; Guangli Yao; Jue Wang; Jianjiang Li
In adaptive irregular out-of-core applications, communications and mass disk I/O operations occupy a large portion of the overall execution. This paper presents a program transformation scheme to enable overlap of communication, computation and disk I/O in this kind of applications. We take programs in inspector-executor model as starting point, and transform them to a pipeline fashion. By decomposing the inspector phase and reordering iterations, more overlap opportunities are efficiently utilized. In the experiments, our techniques are applied to two important applications i.e. Partial differential equation solver and Molecular dynamics problems. For these applications, versions employing our techniques are almost 30% faster than inspector-executor versions.
international workshop on openmp | 2007
Jue Wang; Changjun Hu; Jilin Zhang; Jianjiang Li
Many researchers have focused on developing the techniques for the situation where data arrays are indexed through indirection arrays. However, these techniques may be ineffective for nonlinear indexing. In this paper, we propose extensions to OpenMP directives, aiming at efficient irregular OpenMP codes including nonlinear indexing to be executed in parallel. Furthermore, some optimization techniques for irregular computing are presented. These techniques include generation of communication sets and SPMD code, communication scheduling strategy, and low overhead locality transformation scheme. Finally, experimental results are presented to validate our extensions and optimization techniques.