Is this you? Create Your Porfile

Yutong Lu

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yutong Lu is active.

Explore More

Publication

Featured researches published by Yutong Lu.

Frontiers of Computer Science in China | 2014

MilkyWay-2 supercomputer: system and application

Xiangke Liao; Liquan Xiao; Canqun Yang; Yutong Lu

On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design of hardware and software systems. The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity-off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications, proprietary 16-core processor designed for scientific computing, efficient software stacks that provide high performance file system, emerging programming model for heterogeneous systems, and intelligent system administration. We perform extensive evaluation with wide-ranging applications from LINPACK and Graph500 benchmarks to massively parallel software deployed in the system.

acm sigplan symposium on principles and practice of parallel programming | 2013

A peta-scalable CPU-GPU algorithm for global atmospheric simulations

Chao Yang; Wei Xue; Haohuan Fu; Lin Gan; Linfeng Li; Yangtong Xu; Yutong Lu; Jiachang Sun; Guangwen Yang; Weimin Zheng

Developing highly scalable algorithms for global atmospheric modeling is becoming increasingly important as scientists inquire to understand behaviors of the global atmosphere at extreme scales. Nowadays, heterogeneous architecture based on both processors and accelerators is becoming an important solution for large-scale computing. However, large-scale simulation of the global atmosphere brings a severe challenge to the development of highly scalable algorithms that fit well into state-of-the-art heterogeneous systems. Although successes have been made on GPU-accelerated computing in some top-level applications, studies on fully exploiting heterogeneous architectures in global atmospheric modeling are still very less to be seen, due in large part to both the computational difficulties of the mathematical models and the requirement of high accuracy for long term simulations. In this paper, we propose a peta-scalable hybrid algorithm that is successfully applied in a cubed-sphere shallow-water model in global atmospheric simulations. We employ an adjustable partition between CPUs and GPUs to achieve a balanced utilization of the entire hybrid system, and present a pipe-flow scheme to conduct conflict-free inter-node communication on the cubed-sphere geometry and to maximize communication-computation overlap. Systematic optimizations for multithreading on both GPU and CPU sides are performed to enhance computing throughput and improve memory efficiency. Our experiments demonstrate nearly ideal strong and weak scalabilities on up to 3,750 nodes of the Tianhe-1A. The largest run sustains a performance of 0.8 Pflops in double precision (32% of the peak performance), using 45,000 CPU cores and 3,750 GPUs.

IEEE Transactions on Computers | 2015

Ultra-Scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe-2

Wei Xue; Chao Yang; Haohuan Fu; Xinliang Wang; Yangtong Xu; Junfeng Liao; Lin Gan; Yutong Lu; Rajiv Ranjan; Lizhe Wang

In this work an ultra-scalable algorithm is designed and optimized to accelerate a 3D compressible Euler atmospheric model on the CPU-MIC hybrid system of Tianhe-2. We first reformulate the mesocale model to avoid long-latency operations, and then employ carefully designed inter-node and intra-node domain decomposition algorithms to achieve balance utilization of different computing units. Proper communication-computation overlap and concurrent data transfer methods are utilized to reduce the cost of data movement at scale. A variety of optimization techniques on both the CPU side and the accelerator side are exploited to enhance the in-socket performance. The proposed hybrid algorithm successfully scales to 6,144 Tianhe-2 nodes with a nearly ideal weak scaling efficiency, and achieve over 8 percent of the peak performance in double precision. This ultra-scalable hybrid algorithm may be of interest to the community to accelerating atmospheric models on increasingly dominated heterogeneous supercomputers.

international parallel and distributed processing symposium | 2014

Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe-2

Wei Xue; Chao Yang; Haohuan Fu; Xinliang Wang; Yangtong Xu; Lin Gan; Yutong Lu; Xiaoqian Zhu

This paper presents a hybrid algorithm for the petascale global simulation of atmospheric dynamics on Tianhe-2, the worlds current top-ranked supercomputer developed by Chinas National University of Defense Technology (NUDT). Tianhe-2 is equipped with both Intel Xeon CPUs and Intel Xeon Phi accelerators. A key idea of the hybrid algorithm is to enable flexible domain partition between an arbitrary number of processors and accelerators, so as to achieve a balanced and efficient utilization of the entire system. We also present an asynchronous and concurrent data transfer scheme to reduce the communication overhead between CPU and accelerators. The acceleration of our global atmospheric model is conducted to improve the use of the Intel MIC architecture. For the single-node test on Tianhe-2 against two Intel Ivy Bridge CPUs (24 cores), we can achieve 2.07×, 3.18×, and 4.35× speedups when using one, two, and three Intel Xeon Phi accelerators respectively. The average performance gain from SIMD vectorization on the Intel Xeon Phi processors is around 5× (out of the 8× theoretical case). Based on successful computation-communication overlapping, large-scale tests indicate that a nearly ideal weak-scaling efficiency of 93.5% is obtained when we gradually increase the number of nodes from 6 to 8,664 (nearly 1.7 million cores). In the strong-scaling test, the parallel efficiency is about 77% when the number of nodes increases from 1,536 to 8,664 for a fixed 65,664 × 5,664 × 6 mesh with 77.6 billion unknowns.

ieee international conference on high performance computing data and analytics | 2014

Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices

Jongsoo Park; Mikhail Smelyanskiy; Karthikeyan Vaidyanathan; Alexander Heinecke; Dhiraj D. Kalamkar; Xing Liu; Md. Mosotofa Ali Patwary; Yutong Lu; Pradeep Dubey

A new sparse high performance conjugate gradient benchmark (HPCG) has been recently released to address challenges in the design of sparse linear solvers for the next generation extreme-scale computing systems. Key computation, data access, and communication pattern in HPCG represent building blocks commonly found in todays HPC applications. While it is a well known challenge to efficiently parallelize Gauss-Seidel smoother, the most time-consuming kernel in HPCG, our algorithmic and architecture-aware optimizations deliver 95% and 68% of the achievable bandwidth on Xeon and Xeon Phi, respectively. Based on available parallelism, our Xeon Phi shared-memory implementation of Gauss-Seidel smoother selectively applies block multi-color reordering. Combined with MPI parallelization, our implementation balances parallelism, data access locality, CG convergence rate, and communication overhead. Our implementation achieved 580 TFLOPS (82% parallelization efficiency) on Tianhe-2 system, ranking first on the most recent HPCG list in July 2014. In addition, we demonstrate that our optimizations not only benefit HPCG original dataset, which is based on structured 3D grid, but also a wide range of unstructured matrices.

ieee international conference on high performance computing data and analytics | 2014

Using the Intel Many Integrated Core to accelerate graph traversal

Tao Gao; Yutong Lu; Baida Zhang; Guang Suo

Data-intensive applications have drawn more and more attention in the last few years. The basic graph traversal algorithm, the breadth-first search (BFS), a typical data-intensive application, is widely used and the Graph 500 benchmark uses it to rank the performance of supercomputers. The Intel Many Integrated Core (MIC) architecture, which is designed for highly parallel computing, has not been fully evaluated for graph traversal. In this paper, we discuss how to use the MIC to accelerate the BFS. We present some optimizations for native BFS algorithms and develop a heterogeneous BFS algorithm. For the native BFS algorithm, we mainly discuss how to exploit many cores and wide-vector processing units. The performance of our optimized native BFS implementation is 5.3 times that of the highest published performance for graphics processing units (GPU). For the heterogeneous BFS algorithm, the performance of the general processing unit (CPU) and MIC cooperative computing can gain an increase in speed of approximately 1.4 times than that of a CPU for graphs with 2M vertices. This work is valuable for using a MIC to accelerate the BFS. It is also a general guidance for a MIC used for data-intensive applications.

international conference on algorithms and architectures for parallel processing | 2014

Optimizing and Scaling HPCG on Tianhe-2: Early Experience

Xianyi Zhang; Chao Yang; Fangfang Liu; Yiqun Liu; Yutong Lu

In this paper, a first attempt has been made on optimizing and scaling HPCG on the world’s largest supercomputer, Tianhe-2. This early work focuses on the optimization of the CPU code without using the Intel Xeon Phi coprocessors. In our work, we reformulate the basic CG algorithm to minimize the cost of collective communication and employ several optimizing techniques such as SIMDization, loop unrolling, forward and backward sweep fusion, OpenMP parallization to further enhance the performance of kernels such as the sparse matrix vector multiplication, the symmetric Gauss–Seidel relaxation and the geometric multigrid v-cycle. We successfully scale the HPCG code from 256 up to 6,144 nodes (147,456 CPU cores) on Tianhe-2, with a nearly ideal weak scalability and an aggregate performance of 79.83 Tflops, which is 6.38X higher than the reference implementation.

ieee international conference on high performance computing data and analytics | 2016

623 Tflop/s HPCG run on Tianhe-2

Yiqun Liu; Chao Yang; Fangfang Liu; Xianyi Zhang; Yutong Lu; Yunfei Du; Canqun Yang; Min Xie; Xiangke Liao

In this article, we present a new hybrid algorithm to enable and scale the high-performance conjugate gradients (HPCG) benchmark on large-scale heterogeneous systems such as the Tianhe-2. Based on an inner–outer subdomain partitioning strategy, the data distribution between host and device can be balanced adaptively. The overhead of data movement from both the MPI communication and the PCI-E transfer can be significantly reduced by carefully rearranging and fusing operations. A variety of parallelization and optimization techniques for performance-critical kernels are exploited and analyzed to maximize the performance gain on both host and device. We carry out experiments on both a small heterogeneous computer and the world’s largest one, the Tianhe-2. On the small system, a thorough comparison and analysis has been presented to select from different optimization choices. On Tianhe-2, the optimized implementation scales to the full-system level of 3.12 million heterogeneous cores, with an aggregated performance of 623 Tflop/s and a parallel efficiency of 81.2%.

Journal of Computer Science and Technology | 2015

High Performance Interconnect Network for Tianhe System

Xiangke Liao; Zhengbin Pang; Kefei Wang; Yutong Lu; Min Xie; Jun Xia; Dezun Dong; Guang Suo

In this paper, we present the Tianhe-2 interconnect network and message passing services. We describe the architecture of the router and network interface chips, and highlight a set of hardware and software features effectively supporting high performance communications, ranging over remote direct memory access, collective optimization, hardware enable reliable end-to-end communication, user-level message passing services, etc. Measured hardware performance results are also presented.

international conference on parallel and distributed systems | 2014

Accelerating HPCG on Tianhe-2: A hybrid CPU-MIC algorithm

Yiqun Liu; Xianyi Zhang; Chao Yang; Fangfang Liu; Yutong Lu

In this paper, we propose a hybrid algorithm to enable and accelerate the High Performance Conjugate Gradient (HPCG) benchmark on a heterogeneous node with an arbitrary number of accelerators. In the hybrid algorithm, each subdomain is assigned to a node after a three-dimensional domain decomposition. The subdomain is further divided to several regular inner blocks and an outer part with a flexible inner-outer partitioning strategy. Each inner task is assigned to a MIC device and the size is adjustable to adapt the accelerators computational power. The only outer part is assigned to CPU and the thickness of boundary size is also adjustable to maintain load balance between CPU and MICs. By properly fusing the computational kernels with preceding ones, we present an asynchronous data transfer scheme to better overlap local computation with the PCI-express data transfer. All basic HPCG kernels, especially the time-consuming sparse matrix-vector multiplication (SpMV) and the symmetric Gauss-Seidel relaxation (SymGS), are extensively optimized for both CPU and MIC, on both algorithmic and architectural levels. On a single node of Tianhe-2 which is composed of an Intel Xeon processor and three Intel Xeon Phi coprocessors, we successfully obtain an aggregated performance of 50.2 Gflops, which is around 1.5% of the peak performance.

Explore More