Is this you? Create Your Porfile

Jingfei Jiang

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jingfei Jiang is active.

Explore More

Publication

Featured researches published by Jingfei Jiang.

ACM Transactions on Reconfigurable Technology and Systems | 2017

Throughput-Optimized FPGA Accelerator for Deep Convolutional Neural Networks

Zhiqiang Liu; Yong Dou; Jingfei Jiang; Jinwei Xu; Shijie Li; Yongmei Zhou; Yingnan Xu

Deep convolutional neural networks (CNNs) have gained great success in various computer vision applications. State-of-the-art CNN models for large-scale applications are computation intensive and memory expensive and, hence, are mainly processed on high-performance processors like server CPUs and GPUs. However, there is an increasing demand of high-accuracy or real-time object detection tasks in large-scale clusters or embedded systems, which requires energy-efficient accelerators because of the green computation requirement or the limited battery restriction. Due to the advantages of energy efficiency and reconfigurability, Field-Programmable Gate Arrays (FPGAs) have been widely explored as CNN accelerators. In this article, we present an in-depth analysis of computation complexity and the memory footprint of each CNN layer type. Then a scalable parallel framework is proposed that exploits four levels of parallelism in hardware acceleration. We further put forward a systematic design space exploration methodology to search for the optimal solution that maximizes accelerator throughput under the FPGA constraints such as on-chip memory, computational resources, external memory bandwidth, and clock frequency. Finally, we demonstrate the methodology by optimizing three representative CNNs (LeNet, AlexNet, and VGG-S) on a Xilinx VC709 board. The average performance of the three accelerators is 424.7, 445.6, and 473.4GOP/s under 100MHz working frequency, which outperforms the CPU and previous work significantly.

field-programmable technology | 2016

Automatic code generation of convolutional neural networks in FPGA implementation

Zhiqiang Liu; Yong Dou; Jingfei Jiang; Jinwei Xu

Convolutional neural networks (CNNs) have gained great success in various computer vision applications. However, state-of-the-art CNN models are computation-intensive and hence are mainly processed on high performance processors like server CPUs and GPUs. Owing to the advantages of high performance, energy efficiency and reconfigurability, Field-Programmable Gate Arrays (FPGAs) have been widely explored as CNN accelerators. In this paper, we propose parallel structures to exploit the inherent parallelism and efficient computation units to perform operations in convolutional and fully-connected layers. Further, an automatic generator is proposed to generate Verilog HDL source code automatically according to high-level hardware description language. Execution time, DSP consumption and performance are analytically modeled based on some critical design variables. We demonstrate the automatic methodology by implementing two representative CNNs (LeNet and AlexNet) and evaluate the execution time models by comparing estimated and measured values. Our results show that the proposed automatic methodology yields hardware design with good performance and saves much developing round time.

applied reconfigurable computing | 2017

Accuracy Evaluation of Long Short Term Memory Network Based Language Model with Fixed-Point Arithmetic.

Ruochun Jin; Jingfei Jiang; Yong Dou

Long Short Term Memory network based language models are state-of-art techniques in the field of natural language processing. Training LSTM networks is computationally intensive, which naturally results in investigating FPGA acceleration where fixed-point arithmetic is employed. However, previous studies have focused only on accelerators using some fixed bit-widths without thorough accuracy evaluation. The main contribution of this paper is to demonstrate the bit-width effect on the LSTM based language model and the tanh function approximation in a comprehensive way by experimental evaluation. Theoretically, the 12-bit number with 6-bit fractional part is the best choice balancing the accuracy and the storage saving. Gaining similar performance to the software implementation and fitting the bit-widths of FPGA primitives, we further propose a mixed bit-widths solution combing 8-bit numbers and 16-bit numbers. With clear trade-off in accuracy, our results provide a guide to inform the design choices on bit-widths when implementing LSTMs in FPGAs. Additionally, based on our experiments, it is amazing that the scale of the LSTM network is irrelevant to the optimum fixed-point configuration, which indicates that our results are applicable to larger models as well.

asian conference on intelligent information and database systems | 2017

A Super-Vector Deep Learning Coprocessor with High Performance-Power Ratio

Jingfei Jiang; Zhiqiang Liu; Jinwei Xu; Rongdong Hu

The maturity of deep learning theory and the development of computers have made deep learning algorithms powerful tools for mining the underlying features of big data. There is an increasing demand of high-accuracy and real-time object detection for intelligent communication and control tasks in embedded systems. More energy efficient deep learning accelerators are required because of the limited battery and resources in embedded systems. We propose a Super-Vector Coprocessor architecture called SVP-DL. SVP-DL can process various matrix operations used in deep learning algorithms by calculating multidimensional vectors using specified vector and scalar instructions, which enabling flexible combinations of matrix operations and data organization. We verified SVP-DL on a self-developed field-programmable gate array (FPGA) platform. The typical deep belief network and the sparse coding network are programmed on the coprocessor. Experiments results showed that SVP-DL architecture on FPGA can achieve 1.7 to 2.1 times the performance under a low frequency compared with that on a PC platform. SVP-DL on FPGA can also achieve about 9 times the performance-power efficiency of a PC.

Cognitive Systems Research | 2017

High performance robust audio event recognition system based on FPGA platform

Jinwei Xu; Zhiqiang Liu; Jingfei Jiang; Yong Dou

Abstract Audio event recognition is applied in many novel application areas. Opposing the deep CNN, 1-max pooling CNN is a simple, but efficient CNN architecture for robust audio event recognition. This study proposes a parallel architecture to accelerate robust audio event recognition. To implement this in hardware, we evaluate the precision of 1-max pooling CNN model and propose an approximate algorithm to replace the complex calculation in spectral image feature (SIF) extraction. We then propose a scalable parallel structure of SIF extraction and 1-max pooling CNN. The SIF extraction unit has eight parallelisms and the 1-Max Pooling CNN accelerator has 40 processor elements (PEs) in our implementation. The entire system is implemented on the Xilinx VC709 board. The average performance of our FPGA accelerator is 675.7 fps under 100 MHz working frequency, which is about 31.9 × speed-up compare with CPU. We further implement a small-scale FPGA array with four Xilinx FPGA for robust audio event recognition. To communicate between the four FPGA and the host, we design a route protocol based on source route algorithm.

Archive | 2016

Fixed-Point Evaluation of Extreme Learning Machine for Classification

Yingnan Xu; Jingfei Jiang; Juping Jiang; Zhiqiang Liu; Jinwei Xu

With growth of data sets, the efficiency of Extreme Learning Machine (ELM) model combined with accustomed hardware implementation such as Field-programmable gate array (FPGA) became attractive for many real-time learning tasks. In order to reduce resource occupation in eventual trained model on FPGA, it is more efficient to store fixed-point data rather than double-floating data in the on-chip RAMs. This paper conducts the fixed-point evaluation of ELM for classification. We converted the ELM algorithm into a fixed-point version by changing the operation type, approximating the complex function and blocking the large-scale matrixes, according to the architecture ELM would be implemented on FPGA. The performance of classification with single bit-width and mixed bit-width were evaluated respectively. Experimental results show that the fixed-point representation used on ELM does work for some application, while the performance could be better if we adopt mixed bit-width.

Archive | 2015

Implementation of a Fine-Grained Parallel Full Pipeline Schnorr–Euchner Sphere Decoder Algorithm Accelerator on Field-Programmable Gate Array

Shijie Li; Lei Guo; Yong Dou; Jingfei Jiang

A new parallel full pipeline accelerator implemented on field-programmable gate array (FPGA) for the Schnorr–Euchner sphere decoding (SE–SD) algorithm is presented in this paper. We firstly transform the serial SE–SD algorithm into a parallel one. Afterwards, we use multiple processing elements (PEs) to deal with the workload (particularly for tree searching in the SE–SD algorithm) in parallel. Each separated SE–SD search workload is divided averagely. Each PE searches a sub-tree by using a multilevel pipeline to increase the data throughput, and the whole system obtains a batch of different input data chronologically. We select the number of PEs to distribute our system according to the hardware platform by using a distribution unit. We’ve successfully placed four PEs in an accelerator and eight accelerators in a single FPGA (XC6VLX240T). The system obtains remarkable benefit in changing the accelerate mode, including latency- and throughput-prior modes.

Cybernetics and Information Technologies | 2014

Experimental Demonstration Of The Fixed-Point Sparse Coding Performance

Jingfei Jiang; Rongdong Hu; Fei Zhang; Yong Dou

Abstract The Sparse Coding (SC) model has been proved to be among the best neural networks which are mainly used in unsupervised feature learning for many applications. Running a sparse coding algorithm is a time-consuming task due to its large scale and processing characteristics, which naturally leads to investigating FPGA acceleration. Fixed-point arithmetic can be used when implementing SC in FPGAs to reduce the execution time, but the implications for accuracy are not clear. Previous studies have focused only on accelerators using some fixed bitwidths on other neural networks models. Our work gives a comprehensive evaluation to demonstrate the bit-width effect on SCs, achieving the best performance and area efficiency. The method of data format conversion and the matrix blocking are the main factors considered according to the situation of hardware implementation. The simulation method of the simple truncation, the representation of the domain constraint and the matrix blocking with different parallelism were evaluated in this paper. The results have shown that the fixedpoint bit-width did have effect on the performance of SC. We must limit the representation domain of the data carefully and select an available bit-width according to the computation parallelism. The result has also shown that using a fixed-point arithmetic can guarantee the precision of the SC algorithm and get acceptable convergence speed.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

A Flexible Memory Controller Supporting Deep Belief Networks with Fixed-Point Arithmetic

Jingfei Jiang; Rongdong Hu; Mikel Luján

applied reconfigurable computing | 2013

Empirical evaluation of fixed-point arithmetic for deep belief networks

Jingfei Jiang; Rongdong Hu; Mikel Luján; Yong Dou

Deep Belief Networks (DBNs) are state-of-art Machine Learning techniques and one of the most important unsupervised learning algorithms. Training DBNs is computationally intensive which naturally leads to investigate FPGA acceleration. Fixed-point arithmetic can be used when implementing DBNs in FPGAs to reduce execution time, but it is not clear the implications for accuracy. Previous studies have focused only on accelerators using some fixed bit-widths. A contribution of this paper is to demonstrate the bit-width effect on various configurations of DBNs in a comprehensive way by experimental evaluation. Our work is inspired by the original DBN built on a subset of neural networks known as Restricted Boltzmann Machine (RBM) and the idea of Stacked Denoising Auto-Encoder (SDAE). We modified the floating-point versions of the original DBN and the denoising DBN (dDN) into fixed-point versions and compared their performance. Explicit performance changing points are found using various bit-widths. The results indicate that different configurations of DBNs have different performance changing points. The performance variations of three layers DBNs are a little larger than one layer DBNs because of the better sensitivity of deeper DBN. Sigmoid function approximation methods must be used when implementing DBNs in FPGA. The impacts of Piecewise Linear Approximation of nonlinearity algorithms (PLA) with two different precisions are evaluated quantitatively in our experiments. Modern FPGAs supply built-in primitives to support matrix operations including multiplications, accumulations and additions, which are the main operations of DBNs. A solution of mixed bit-widths DBN is proposed that a narrower bitwidth can be used for neural units and a wider one can be used for weights, thus fitting the bit-widths of FPGA primitives and gaining similar performance to the software implementation. Our results provide a guide to inform the design choices on bit-widths when implementing DBNs in FPGAs documenting clearly the trade-off in accuracy.

Explore More