Leiming Yu
Northeastern University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Leiming Yu.
ieee international symposium on workload characterization | 2016
Yifan Sun; Xiang Gong; Amir Kavyan Ziabari; Leiming Yu; Xiangyu Li; Saoni Mukherjee; Carter McCardwell; Alejandro Villegas; David R. Kaeli
Graphics Processing Units (GPUs) can easily outperform CPUs in processing large-scale data parallel workloads, but are considered weak in processing serialized tasks and communicating with other devices. Pursuing a CPU-GPU collaborative computing model which takes advantage of both devices could provide an important breakthrough in realizing the full performance potential of heterogeneous computing. In recent years platform vendors and runtime systems have added new features such as unified memory space and dynamic parallelism, providing a path to CPU-GPU coordination and necessary programming infrastructure to support future heterogeneous applications. As the rate of adoption of CPU-GPU collaborative computing continues to increase, it becomes increasingly important to formalize CPU-GPU collaborative programming paradigms and understand the impact of this emerging model on overall application performance. We propose the Hetero-Mark to help heterogeneous system programmers understand CPU-GPU collaborative computing and to provide guidance to computer architects in order to enhance the design of the runtime and the driver. We summarize seven common CPU-GPU collaborative computing programming patterns and include at least one benchmark for each pattern in the suite. We also characterize different workloads in Hetero- Mark to analyze execution metrics specific to CPU-GPU collaborative computing, including CPU and GPU performance, CPU-GPU communication latency and memory transfer latency.
international conference on performance engineering | 2015
Yash Ukidave; Fanny Nina Paravecino; Leiming Yu; Charu Kalra; Amir Momeni; Zhongliang Chen; Nick Materise; Brett Daley; Perhaad Mistry; David R. Kaeli
Heterogeneous systems consisting of multi-core CPUs, Graphics Processing Units (GPUs) and many-core accelerators have gained widespread use by application developers and data-center platform developers. Modern day heterogeneous systems have evolved to include advanced hardware and software features to support a spectrum of application patterns. Heterogeneous programming frameworks such as CUDA, OpenCL, and OpenACC have all introduced new interfaces to enable developers to utilize new features on these platforms. In emerging applications, performance optimization is not only limited to effectively exploiting data-level parallelism, but includes leveraging new degrees of concurrency and parallelism to accelerate the entire application. To aid hardware architects and application developers in effectively tuning performance on GPUs, we have developed the NUPAR benchmark suite. The NUPAR applications belong to a number of different scientific and commercial computing domains. These benchmarks exhibit a range of GPU computing characteristics that consider memory-bandwidth limitations, device occupancy and resource utilization, synchronization latency and device-specific compute optimizations. The NUPAR applications are specifically designed to stress new hardware and software features that include: nested parallelism, concurrent kernel execution, shared host-device memory and new instructions for precise computation and data movement. In this paper, we focus our discussion on applications developed in CUDA and OpenCL, and focus on high-end server class GPUs. We describe these benchmarks and evaluate their interaction with different architectural features on a GPU. Our evaluation examines the behavior of the advanced hardware features on recently-released GPU architectures.
international workshop on opencl | 2015
Saoni Mukherjee; Xiang Gong; Leiming Yu; Carter McCardwell; Yash Ukidave; Tuan Dao; Fanny Nina Paravecino; David R. Kaeli
The growth in demand for heterogeneous accelerators has stimulated the development of cutting-edge features in newer accelerators. The heterogeneous programming frameworks such as OpenCL have matured over the years and have introduced new software features for developers. We explore one of these programming frameworks, OpenCL 2.0. To drive our study, we consider a number new features in OpenCL 2.0 using four popular applications from a range of computing domains including signal processing, cybersecurity and machine learning. These applications include: 1) the AES-128 encryption standard, 2) Finite Impulse Response filtering, 3) Infinite Impulse Response filtering, and 4) Hidden Markov model. In this work, we introduce the latest runtime features enabled in OpenCL 2.0, and discuss how well our sample applications can benefit from some of these features.
acm sigplan symposium on principles and practice of parallel programming | 2017
Leiming Yu; Abraham Goldsmith; Stefano Di Cairano
GPU applications have traditionally run on PCs or in larger scale systems. With the introduction of the Tegra line of mobile processors, NVIDIA expanded the types of systems that can exploit the massive parallelism offered by GPU computing architectures. In this paper, we evaluate the suitability of the Tegra X1 processor as a platform for embedded model predictive control. MPC relies on the real time solution of a convex optimization problem to compute the control input(s) to a system. Relative to traditional control techniques such as PID, MPC is very computationally demanding. Quadratic programming algorithms for the solution of convex optimization problems generally lend themselves to parallelization. However, until the introduction of the Tegra, there has never been an off-the-shelf embedded processor that would enable a massively parallel embedded implementation. We investigate two different gradient based algorithms, ADMM and PQP, for solving the QP that occurs in a large class of MPC problems. The performance of these algorithms is dominated by the performance of matrix-matrix and matrix-vector products. Our work focuses on maximizing the performance of these operations for relatively small matrices of 100 to 1000 elements per dimension, which are common in the MPC control implementations found in automotive and factory automation applications. Modern BLAS libraries for CPUs and GPUs are quantitatively evaluated. We create SGEMV kernels that can outperform the state-of-the-art cuBLAS by 2.3x on TX1. Different kernel fusion schemes utilizing concurrent kernel execution and zero copy mechanisms are investigated. For ADMM, our implementation achieves 46.6x speedup over the single threaded CPU version and 2.7x speedup over the optimized OpenBLAS version. For PQP, we achieve 41.2x speedup over the single threaded CPU version and 4.2x speedup over the OpenBLAS version.
Journal of Applied Crystallography | 2016
Yan Zhang; Hideyo Inouye; Michael F. Crowley; Leiming Yu; David R. Kaeli; Lee Makowski
Intensity simulation of X-ray scattering from large twisted cellulose molecular fibrils is important in understanding the impact of chemical or physical treatments on structural properties such as twisting or coiling. This paper describes a highly efficient method for the simulation of X-ray diffraction patterns from complex fibrils using atom-type-specific pair-distance quantization. Pair distances are sorted into arrays which are labelled by atom type. Histograms of pair distances in each array are computed and binned and the resulting population distributions are used to represent the whole pair-distance data set. These quantized pair-distance arrays are used with a modified and vectorized Debye formula to simulate diffraction patterns. This approach utilizes fewer pair distances in each iteration, and atomic scattering factors are moved outside the iteration since the arrays are labelled by atom type. This algorithm significantly reduces the computation time while maintaining the accuracy of diffraction pattern simulation, making possible the simulation of diffraction patterns from large twisted fibrils in a relatively short period of time, as is required for model testing and refinement.
acm sigplan symposium on principles and practice of parallel programming | 2015
Leiming Yu; Yan Zhang; Xiang Gong; Nilay Roy; Lee Makowski; David R. Kaeli
Cellulose is one of the most promising energy resources that is waiting to be tapped. Harvesting energy from cellulose requires decoding its atomic structure. Some structural information can be exposed by modeling data produced by X-ray scattering. Forward simulation can be used to explore structural parameters of cellulose, including the diameter, twist and coiling, but modeling fiber scattering is computationally challenging. In this paper, we explore how to accelerate a molecular scattering algorithm by leveraging a modern high-end Graphic Processing Unit (GPU). A step-wise optimization approach is described in this work that considers memory utilization, math intrinsics, concurrent kernel execution and workload partitioning. Different caching strategies to manage the state of the atom volume in memory are taken into account. We have developed optimized cluster solutions for both CPUs and GPUs. Different workload distribution schemes and con- current execution approaches for both CPUs and GPUs have been investigated. Leveraging accelerators hosted on a cluster, we have reduced days/weeks of intensive simulation to parallel execution of just a few minutes/seconds. Our GPU-integrated cluster solution can potentially support concurrent modeling of hundreds of cellulose fibril structures, opening up new avenues for energy research.
northeast bioengineering conference | 2014
Yan Zhang; Leiming Yu; David R. Kaeli; Lee Makowski
Cellulose in maize is an ideal candidate for a sustainable and renewable energy source because of its global abundance. X-ray scattering has been used as an experimental technique to study cellulose fibrils during processing. In order to understand the impact of chemical or physical treatments on structural parameters such as diameter, twist and coiling, we are simulating X-ray scattering from large bundles of cellulose. Because the structure of these fibrils is non-periodic, these calculations are computationally intensive. In this paper, we describe a GPU-based molecular scattering algorithm developed to calculate X-ray scattering intensity from molecular models of cellulose fibrils. The implementation on Graphics Processing Units (GPUs) greatly accelerates the calculation compared to an optimized CPU program, enabling the assessment of hundreds of models for the cellulose fibril structure.
Journal of Biomedical Optics | 2018
Leiming Yu; Fanny Nina-Paravecino; David R. Kaeli; Qianqian Fang
Abstract. We present a highly scalable Monte Carlo (MC) three-dimensional photon transport simulation platform designed for heterogeneous computing systems. Through the development of a massively parallel MC algorithm using the Open Computing Language framework, this research extends our existing graphics processing unit (GPU)-accelerated MC technique to a highly scalable vendor-independent heterogeneous computing environment, achieving significantly improved performance and software portability. A number of parallel computing techniques are investigated to achieve portable performance over a wide range of computing hardware. Furthermore, multiple thread-level and device-level load-balancing strategies are developed to obtain efficient simulations using multiple central processing units and GPUs.
ieee international symposium on workload characterization | 2017
Leiming Yu; Xun Gong; Yifan Sun; Qianqian Fang; Norm Rubin; David R. Kaeli
GPUs continue to increase the number of compute resources with each new generation. Many data-parallel applications have been re-engineered to leverage the thousands of cores on the GPU. But not every kernel can fully utilize all the resources available. Many applications contain multiple kernels that could potentially be run concurrently. To better utilize the massive resources on the GPU, device vendors have started to support Concurrent Kernel Execution (CKE). However, the application throughput provided by CKE is subject to a number of factors, including the kernel configuration attributes, the dynamic behavior of each kernel (e.g., compute-intentive vs. memory-intensive), the kernel launch order and inter-kernel dependencies. Minor changes in any of theses factors can have a large impact on the effectiveness of CKE. In this paper, we present Moka, an empirical model for tuning concurrent kernel performance. Moka allows us to accurately predict the resulting performance and scalability of multi-kernel applications when using CKE. We consider both static and dynamic workload characteristics that impact the utility of CKE, and leverage these metrics to drive kernel scheduling decisions on NVIDIA GPUs. The underlying data transfer pattern and GPU resource contention are analyzed in detail. Our model is able to accurately predict the performance ceiling of concurrent kernel execution. We validate our model using several real-world applications that have multiple kernels that can run concurrently, and evaluate CKE performance on a NVIDIA Maxwell GPU. Our model is able to predict the performance of CKE applications accurately, providing estimates that differ by less than 12% as compared to actual runtime performance. Using our estimates, we can quickly find the best CKE strategy for our applications to achieve improved application throughput. We believe we have developed a useful tool to aid application programmers to accelerate their applications using CKE.
Archive | 2017
Fanny Nina-Paravecino; Leiming Yu; Qianqian Fang; David R. Kaeli
The human brain is a complex biological organ that is extremely challenging to study. Non-invasive optical scanning has been effective in exploring brain functions and diagnosing brain diseases. However, given the complexity of the human brain anatomy, quantitative analysis of optical brain imaging data has been challenging due to the extensive computation needed to solve the generalized models. In this chapter, we discuss Monte Carlo eXtreme (MCX), a computationally efficient and numerically accurate Monte Carlo photon simulation package. Leveraging the benefits of GPU-based parallel computing. MCX allows researchers to use 3D anatomical scans from MRI or CT to perform accurate photon transport simulations. Compared to conventional Monte Carlo (MC) methods, MCX provides a dramatic speed improvement of two to three orders of magnitude, thanks largely to the massively parallel threads enabled by modern GPU architectures.