Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Zheming Jin is active.

Publication


Featured researches published by Zheming Jin.


international workshop on opencl | 2018

Performance-oriented Optimizations for OpenCL Streaming Kernels on the FPGA

Zheming Jin; Hal Finkel

When Field-programmable gate arrays (FPGAs) can implement streaming applications efficiently and high-level synthesis (HLS) tools allow people, who have little hardware design knowledge, to evaluate an application on FPGAs, there is a need to understand where OpenCL and FPGA can play in the streaming domains. To this end, we explore the implementation space and discuss the techniques of optimizing the performance of the streaming kernels using the Intel OpenCL SDK for FPGA. On the Nallatech 385A FPGA platform that features an Arria 10 GX1150 FPGA, the experimental results show that FPGA resources, such as block RAMs and DSPs, can limit the performance of a kernel before the constraint of memory bandwidth takes effect. Kernel vectorization and compute unit duplication are practical optimization techniques that can improve the kernel performance by a factor of 2.8 to 10. The combination of the two techniques can improve the performance by a factor of 3.3 to 16, achieving the highest performance. To improve the performance of streaming kernels with compute unit duplication, the local work size needs to be tuned. The optimal value can increase the performance of a duplicated kernel without tuning by a factor of 3 to 70.


international workshop on opencl | 2018

Nuclear Reactor Simulation on OpenCL FPGA: a Case Study of RSBench

Zheming Jin; Hal Finkel

Field-programmable gate arrays (FPGAs) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current FPGAs. The emerging high-level synthesis tools such as the Intel OpenCL SDK for FPGA highlight a streamlined design flow to facilitate the use of FPGAs in scientific computing. Investigating the characteristics of supercomputing applications, such as nuclear reactor simulation, with the emerging HLS development flow is important for researchers to evaluate and adopt FPGA-based heterogeneous programming models in research facilities and laboratories. In this paper, we evaluate the OpenCL-based FPGA design of a nuclear reactor simulation application RSBench. We describe the OpenCL implementations and optimization methods on an Intel Arria10-based FPGA platform. Compared with the naïve OpenCL kernel, the optimizations of the kernel increase the performance by a factor of 295 on the FPGA. Compared with an Intel Xeon 16-core CPU and an Nvidia K80 GPU, the performance per watt on the FPGA is 3.59 X better than the CPU and 5.8X lower than the GPU.


field programmable gate arrays | 2018

Evaluation of OpenCL Performance-oriented Optimizations for Streaming Kernels on the FPGA: (Abstract Only)

Zheming Jin; Hal Finkel

The streaming applications efficiently and High-level synthesis (HLS) tools allow people without complex hardware design knowledge to evaluate an application on FPGAs, there is an opportunity and a need to understand where OpenCL and FPGA can play in the streaming domains. To this end, we evaluate the overhead of the OpenCL infrastructure on the Nallatech 385A FPGA board that features an Arria 10 GX1150 FPGA. Then we explore the implementation space and discuss the performance optimization techniques for the streaming kernels using the OpenCL-to-FPGA HLS tool. On the target platform, the infrastructure overhead requires 12% of the FPGA memory and logic resources. The latency of the single work-item kernel execution is 11 us and the maximum frequency of a kernel implementation is around 300 MHz. The experimental results of the streaming kernels show FPGA resources, such as block RAMs and DSPs, can limit the kernel performance before the constraint of memory bandwidth takes effect. Kernel vectorization and compute unit duplication are practical optimization techniques that can improve the kernel performance by a factor of 2 to 10. The combination of the two techniques can achieve the best performance. To improve the performance of compute unit duplication, the local work size needs to be tuned and the optimal value can increase the performance by a factor of 3 to 70 compared to the default value.


field programmable gate arrays | 2018

Optimizations of Sequence Alignment on FPGA: A Case Study of Extended Sequence Alignment (Abstact Only)

Zheming Jin; Kazutomo Yoshii

Detecting similarities between sequences is an important part of Bioinformatics. In this poster, we explore the use of high-level synthesis tool and a field-programmable gate array (FPGA) for optimizing a sequence alignment algorithm. We demonstrate the optimization techniques to improve the performance of the extended sequence alignment algorithm in the BWA software package, a tool for mapping DNA sequences against a large reference sequence. Applying the optimizations to the algorithm using Xilinx SDAccel OpenCL-to-FPGA tool, we reduce the kernel execution time from 62.8 ms to 0.45 ms while the power consumption is approximately 11 Watts on the ADM-PCIE-8K5 FPGA platform.


Proceedings of the 9th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies - HEART 2018 | 2018

A Case Study of Integer Sum Reduction using Atomics

Zheming Jin; Hal Finkel

This paper presents the implementations of integer sum reduction using atomic functions on FPGA, CPU, and GPU platforms. We explain the implementations and optimizations of the kernel using an OpenCL-based highlevel synthesis flow for an FPGA. In addition, we describe the optimizations of the reduction using directives for a multi-core CPU and a GPU. The experimental results show that the reduction on an Nvidia K80 GPU is 3.4X and 6.7X faster than an Intel Xeon 16-core CPU and an Arria 10 GX1150 FPGA, respectively. However, the FPGA consumes 4.4X and 2.3X less power than the CPU and GPU, respectively. The performance per watt on the FPGA is 2.2X higher than that on the CPU and 2.9X lower than that on the GPU.


european conference on parallel processing | 2017

Evaluation of a Floating-Point Intensive Kernel on FPGA

Zheming Jin; Hal Finkel; Kazutomo Yoshii; Franck Cappello

Heterogeneous platforms provide a promising solution for high-performance and energy-efficient computing applications. This paper presents our research on usage of heterogeneous platform for a floating-point intensive kernel. We first introduce the floating-point intensive kernel from the geographical information system. Then we analyze the FPGA designs generated by the Intel FPGA SDK for OpenCL, and evaluate the kernel performance and the floating-point error rate of the FPGA designs. Finally, we compare the performance and energy efficiency of the kernel implementations on the Arria 10 FPGA, Intel’s Xeon Phi Knights Landing CPU, and NVIDIA’s Kepler GPU. Our evaluation shows the energy efficiency of the single-precision kernel on the FPGA is 1.35X better than on the CPU and the GPU, while the energy efficiency of the double-precision kernel on the FPGA is 1.36X and 1.72X less than the CPU and GPU, respectively.


Archive | 2017

Evaluation of the Single-precision Floatingpoint Vector Add Kernel Using the Intel FPGA SDK for OpenCL

Zheming Jin; Kazutomo Yoshii; Hal Finkel; Franck Cappello

Open Computing Language (OpenCL) is a high-level language that enables software programmers to explore Field Programmable Gate Arrays (FPGAs) for application acceleration. The Intel FPGA software development kit (SDK) for OpenCL allows a user to specify applications at a high level and explore the performance of low-level hardware acceleration. In this report, we present the FPGA performance and power consumption results of the single-precision floating-point vector add OpenCL kernel using the Intel FPGA SDK for OpenCL on the Nallatech 385A FPGA board. The board features an Arria 10 FPGA. We evaluate the FPGA implementations using the compute unit duplication and kernel vectorization optimization techniques. On the Nallatech 385A FPGA board, the maximum compute kernel bandwidth we achieve is 25.8 GB/s, approximately 76% of the peak memory bandwidth. The power consumption of the FPGA device when running the kernels ranges from 29W to 42W.


parallel, distributed and network-based processing | 2018

Evaluating and Optimizing OpenCL Base64 Data Unpacking Kernel with FPGA

Zheming Jin; Iris Johnson; Hal Finkel


international parallel and distributed processing symposium | 2018

Evaluation of MD5Hash Kernel on OpenCL FPGA Platform

Zheming Jin; Hal Finkel


international parallel and distributed processing symposium | 2018

Optimizing Parallel Reduction on OpenCL FPGA Platform – A Case Study of Frequent Pattern Compression

Zheming Jin; Hal Finkel

Collaboration


Dive into the Zheming Jin's collaboration.

Top Co-Authors

Avatar

Hal Finkel

Argonne National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Kazutomo Yoshii

Argonne National Laboratory

View shared research outputs
Top Co-Authors

Avatar

Franck Cappello

Argonne National Laboratory

View shared research outputs
Researchain Logo
Decentralizing Knowledge