Fengbin Tu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Fengbin Tu is active.

Explore More

Publication

Featured researches published by Fengbin Tu.

IEEE Transactions on Very Large Scale Integration Systems | 2017

Deep Convolutional Neural Network Architecture With Reconfigurable Computation Patterns

Fengbin Tu; Shouyi Yin; Peng Ouyang; Shibin Tang; Leibo Liu; Shaojun Wei

Deep convolutional neural networks (DCNNs) have been successfully used in many computer vision tasks. Previous works on DCNN acceleration usually use a fixed computation pattern for diverse DCNN models, leading to imbalance between power efficiency and performance. We solve this problem by designing a DCNN acceleration architecture called deep neural architecture (DNA), with reconfigurable computation patterns for different models. The computation pattern comprises a data reuse pattern and a convolution mapping method. For massive and different layer sizes, DNA reconfigures its data paths to support a hybrid data reuse pattern, which reduces total energy consumption by 5.9~8.4 times over conventional methods. For various convolution parameters, DNA reconfigures its computing resources to support a highly scalable convolution mapping method, which obtains 93% computing resource utilization on modern DCNNs. Finally, a layer-based scheduling framework is proposed to balance DNA’s power efficiency and performance for different DCNNs. DNA is implemented in the area of 16 mm2 at 65 nm. On the benchmarks, it achieves 194.4 GOPS at 200 MHz and consumes only 479 mW. The system-level power efficiency is 152.9 GOPS/W (considering DRAM access power), which outperforms the state-of-the-art designs by one to two orders.

international symposium on circuits and systems | 2015

Neural approximating architecture targeting multiple application domains

Fengbin Tu; Shouyi Yin; Peng Ouyang; Leibo Liu; Shaojun Wei

Approximate computing emerges as a promising technique for high energy efficiency. Multi-layer perceptron (MLP) models can be used to approximate many modern applications, with little quality loss. However, the various MLP topologies limits the hardwares performance in all cases. In this paper, a scheduling framework is proposed to guide mapping MLPs onto limited hardware resources with high performance. We then design a reconfigurable neural architecture (RNA) to support the proposed scheduling framework. RNA can be reconfigured to accelerate different MLP topologies, and achieves higher performance than other MLP accelerators.

symposium on vlsi circuits | 2017

A 1.06-to-5.09 TOPS/W reconfigurable hybrid-neural-network processor for deep learning applications

Shouyi Yin; Peng Ouyang; Shibin Tang; Fengbin Tu; Xiudong Li; Leibo Liu; Shaojun Wei

An energy-efficient hybrid neural network (NN) processor is implemented in a 65nm technology. It has two 16×16 reconfigurable heterogeneous processing elements (PEs)arrays. To accelerate a hybrid-NN, the PE array is designed to support on demand partitioning and reconfiguration for parallel processing different NNs. To improve energy efficiency, each PE supports bit-width adaptive computing to meet variant bit-width of different neural layers. Measurement results show that this processor achieves a peak 409.6GOPS running at 200MHz and at most 5.09TOPS/W energy efficiency. This processor outperforms the state-of-the-art up to 5.2X in energy efficiency.

2017 IEEE 6th Non-Volatile Memory Systems and Applications Symposium (NVMSA) | 2017

AEPE: An area and power efficient RRAM crossbar-based accelerator for deep CNNs

Shibin Tang; Shouyi Yin; Shixuan Zheng; Peng Ouyang; Fengbin Tu; Leiyue Yao; JinZhou Wu; Wenming Cheng; Leibo Liu; Shaojun Wei

Deep convolutional neural networks (CNN) have shown great accuracy on object recognition and classification tasks. Deep CNNs are computation intensive algorithms, hence many customized RRAM crossbar-based accelerators are proposed to meet the computing demands in deep CNNs, but the area costs and the power consumption are still great challenges for RRAM crossbar-based accelerators. In this work, we propose an area and power efficient RRAM crossbar-based accelerator for deep CNNs. It improves the area efficiency by reducing the area costs of the on-chip buffer and the on-chip network. The power efficiency is improved by reducing the number of used digital-to-analog converters (DACs) and balancing the tradeoff between the accuracy of deep CNNs and power costs of analog-to-digital converters (ADCs). The experimental results show that the proposed accelerator improve the power efficiency by 2.71× and the area efficiency by 2.41× over the state-of-the-art RRAM crossbar-based accelerator, and the accuracy loss is less than 0.5%.

international symposium on computer architecture | 2018

RANA: towards efficient neural acceleration with refresh-optimized embedded DRAM

Fengbin Tu; Weiwei Wu; Shouyi Yin; Leibo Liu; Shaojun Wei

The growing size of convolutional neural networks (CNNs) requires large amounts of on-chip storage. In many CNN accelerators, their limited on-chip memory capacity causes massive off-chip memory access and leads to very high system energy consumption. Embedded DRAM (eDRAM), with higher density than SRAM, can be used to improve on-chip buffer capacity and reduce off-chip access. However, eDRAM requires periodic refresh to maintain data retention, which costs much energy consumption. Refresh is unnecessary if the datas lifetime in eDRAM is shorter than the eDRAMs retention time. Based on this principle, we propose a Retention-Aware Neural Acceleration (RANA) framework for CNN accelerators to save total system energy consumption with refresh-optimized eDRAM. The RANA framework includes three levels of techniques: a retention-aware training method, a hybrid computation pattern and a refresh-optimized eDRAM controller. At the training level, CNNs error resilience is exploited in training to improve eDRAMs tolerable retention time. At the scheduling level, RANA assigns each CNN layer with a computation pattern that consumes the lowest energy. At the architecture level, a refresh-optimized eDRAM controller is proposed to alleviate unnecessary refresh operations. We implement an evaluation platform to verify RANA. Owing to the RANA framework, 99.7% eDRAM refresh operations can be removed with negligible performance and accuracy loss. Compared with the conventional SRAM-based CNN accelerator, an eDRAM-based CNN accelerator strengthened by RANA can save 41.7% off-chip memory access and 66.2% system energy consumption, with the same area cost.

design automation conference | 2018

LCP: a layer clusters paralleling mapping method for accelerating inception and residual networks on FPGA

Xinhan Lin; Shouyi Yin; Fengbin Tu; Leibo Liu; Xiangyu Li; Shaojun Wei

Deep convolutional neural networks (DCNNs) have been widely used in various AI applications. Inception and Residual are two promising structures adopted in many important modern DCNN models, including AlphaGo Zeros model. These structures allow considerably increasing the depth and width of the network to improve accuracy, without increasing the computational budget or the difficulty of convergence. Various accelerators for DCNNs have been proposed based on FPGA platform because it has advantages of high performance, good power efficiency, and fast development round, etc. However, previous FPGA mapping methods cannot fully adapt to the different data localities among layers and other characteristics of Inception and Residual, which leads to a under-utilization of FPGA resources. We propose LCP, a Layer Clusters Paralleling mapping method to classify the layers into clusters based on their differences of parameters and data localities, and then accelerate them in different partitions of FPGA. We evaluate our mapping method by implementing Inception/Residual modules from GoogLeNet [8] and ResNet-50 [4] on Xilinx VC709 (Virtex 690T) FPGA. The results show that the proposed method fully utilizes resources and achieves up to 4.03 × performance than the baseline and 2.00 × performance than the state-of-the-art methods.

international symposium on next generation electronics | 2016

A novel hardware accelerator guideline for ANN with high performance

Tianbao Chen; Shouyi Yin; Peng Ouyang; Fengbin Tu; Leibo Liu; Shaojun Wei

Artificial Neural Network (ANN) is widely used in machine learning and artificial intelligence areas. But ANN requires a long running time and induces a high power consumption when running on a GPU or CPU which may hinder its application in embedded system. This paper proposes a hardware accelerator design guideline for ANN with arbitrary scales and depths. We take full consideration of the hardware scale, performance and power consumption.

design, automation, and test in europe | 2015

RNA: a reconfigurable architecture for hardware neural acceleration

Fengbin Tu; Shouyi Yin; Peng Ouyang; Leibo Liu; Shaojun Wei

As the energy problem has become a big concern in digital system design, one promising solution is combining the core processor with a multi-purpose accelerator targeting high performance applications. Many modern applications can be approximated by multi-layer perceptron (MLP) models, with little quality loss. However, many current MLP accelerators have several drawbacks, such as the unbalance of their performance and flexibility. In this paper, we propose a scheduling framework to guide mapping MLPs onto limited hardware resources with high performance. The framework successfully solves the main constraints of hardware neural acceleration. Furthermore, we implement a reconfigurable neural architecture (RNA) based on this framework, whose computing pattern can be reconfigured for different MLP topologies. The RNA achieves comparable performance with application-specific accelerators and greater flexibility than other hardware MLPs.

IEEE Journal of Solid-state Circuits | 2018