Boxun Li
Tsinghua University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Boxun Li.
field programmable gate arrays | 2016
Jiantao Qiu; Jie Wang; Song Yao; Kaiyuan Guo; Boxun Li; Erjin Zhou; Jincheng Yu; Tianqi Tang; Ningyi Xu; Sen Song; Yu Wang; Huazhong Yang
In recent years, convolutional neural network (CNN) based methods have achieved great success in a large number of applications and have been among the most powerful and widely used techniques in computer vision. However, CNN-based methods are com-putational-intensive and resource-consuming, and thus are hard to be integrated into embedded systems such as smart phones, smart glasses, and robots. FPGA is one of the most promising platforms for accelerating CNN, but the limited bandwidth and on-chip memory size limit the performance of FPGA accelerator for CNN. In this paper, we go deeper with the embedded FPGA platform on accelerating CNNs and propose a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification. We first present an in-depth analysis of state-of-the-art CNN models and show that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric. Then the dynamic-precision data quantization method and a convolver design that is efficient for all layer types in CNN are proposed to improve the bandwidth and resource utilization. Results show that only 0.4% accuracy loss is introduced by our data quantization flow for the very deep VGG16 model when 8/4-bit quantization is used. A data arrangement method is proposed to further ensure a high utilization of the external memory bandwidth. Finally, a state-of-the-art CNN, VGG16-SVD, is implemented on an embedded FPGA platform as a case study. VGG16-SVD is the largest and most accurate network that has been implemented on FPGA end-to-end so far. The system on Xilinx Zynq ZC706 board achieves a frame rate at 4.45 fps with the top-5 accuracy of 86.66% using 16-bit quantization. The average performance of convolutional layers and the full CNN is 187.8 GOP/s and 137.0 GOP/s under 150MHz working frequency, which outperform previous approaches significantly.
international symposium on low power electronics and design | 2013
Boxun Li; Yi Shan; Miao Hu; Yu Wang; Yiran Chen; Huazhong Yang
The cessation of Moores Law has limited further improvements in power efficiency. In recent years, the physical realization of the memristor has demonstrated a promising solution to ultra-integrated hardware realization of neural networks, which can be leveraged for better performance and power efficiency gains. In this work, we introduce a power efficient framework for approximated computations by taking advantage of the memristor-based multilayer neural networks. A programmable memristor approximated computation unit (Memristor ACU) is introduced first to accelerate approximated computation and a memristor-based approximated computation framework with scalability is proposed on top of the Memristor ACU. We also introduce a parameter configuration algorithm of the Memristor ACU and a feedback state tuning circuit to program the Memristor ACU effectively. Our simulation results show that the maximum error of the Memristor ACU for 6 common complex functions is only 1.87% while the state tuning circuit can achieve 12-bit precision. The implementation of HMAX model atop our proposed memristor-based approximated computation framework demonstrates 22× power efficiency improvements than its pure digital implementation counterpart.
asia and south pacific design automation conference | 2014
Boxun Li; Yuzhi Wang; Yu Wang; Yiran Chen; Huazhong Yang
The artificial neural network (ANN) is among the most widely used methods in data processing applications. The memristor-based neural network further demonstrates a power efficient hardware realization of ANN. Training phase is the critical operation of memristor-based neural network. However, the traditional training method for memristor-based neural network is time consuming and energy inefficient. Users have to first work out the parameters of memristors through digital computing systems and then tune the memristor to the corresponding state. In this work, we introduce a mixed-signal training acceleration framework, which realizes the self-training of memristor-based neural network. We first modify the original stochastic gradient descent algorithm by approximating calculations and designing an alternative computing method. We then propose a mixed-signal acceleration architecture for the modified training algorithm by equipping the original memristor-based neural network architecture with the copy crossbar technique, weight update units, sign calculation units and other assistant units. The experiment on the MNIST database demonstrates that the proposed mixed-signal acceleration is 3 orders of magnitude faster and 4 orders of magnitude more energy efficient than the CPU implementation counterpart at the cost of a slight decrease of the recognition accuracy (<; 5%).
design automation conference | 2015
Xiaoxiao Liu; Mengjie Mao; Beiye Liu; Hai Li; Yiran Chen; Boxun Li; Yu Wang; Hao Jiang; Mark Barnell; Qing Wu; Jianhua Yang
Neuromorphic computing is recently gaining significant attention as a promising candidate to conquer the well-known von Neumann bottleneck. In this work, we propose RENO - a efficient reconfigurable neuromorphic computing accelerator. RENO leverages the extremely efficient mixed-signal computation capability of memristor-based crossbar (MBC) arrays to speedup the executions of artificial neural networks (ANNs). The hierarchically arranged MBC arrays can be configured to a variety of ANN topologies through a mixed-signal interconnection network (M-Net). Simulation results on seven ANN applications show that compared to the baseline general-purpose processor, RENO can achieve on average 178.4× (27.06×) performance speedup and 184.2× (25.23×) energy savings in high-efficient multilayer perception (high-accurate auto-associative memory) implementation. Moreover, in the comparison to a pure digital neural processing unit (D-NPU) and a design with MBC arrays co-operating through a digital interconnection network, RENO still achieves the fastest execution time and the lowest energy consumption with similar computation accuracy.
design automation conference | 2015
Boxun Li; Lixue Xia; Peng Gu; Yu Wang; Huazhong Yang
The invention of resistive-switching random access memory (RRAM) devices and RRAM crossbar-based computing system (RCS) demonstrate a promising solution for better performance and power efficiency. The interfaces between analog and digital units, especially AD/DAs, take up most of the area and power consumption of RCS and are always the bottleneck of mixed-signal computing systems. In this work, we propose a novel architecture, MEI, to minimize the overhead of AD/DA by MErging the Interface into the RRAM cross-bar. An optional ensemble method, the Serial Array Adaptive Boosting (SAAB), is also introduced to take advantage of the area and power saved by MEI and boost the accuracy and robustness of RCS. On top of these two methods, a design space exploration is proposed to achieve trade-offs among accuracy, area, and power consumption. Experimental results on 6 diverse benchmarks demonstrate that, compared with the traditional architecture with AD/DAs, MEI is able to save 54.63%~86.14% area and reduce 61.82%~86.80% power consumption under quality guarantees; and SAAB can further improve the accuracy by 5.76% on average and ensure the system performance under noisy conditions.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2015
Boxun Li; Peng Gu; Yi Shan; Yu Wang; Yiran Chen; Huazhong Yang
Approximate computing is a promising design paradigm for better performance and power efficiency. In this paper, we propose a power efficient framework for analog approximate computing with the emerging metal-oxide resistive switching random-access memory (RRAM) devices. A programmable RRAM-based approximate computing unit (RRAM-ACU) is introduced first to accelerate approximated computation, and an approximate computing framework with scalability is then proposed on top of the RRAM-ACU. In order to program the RRAM-ACU efficiently, we also present a detailed configuration flow, which includes a customized approximator training scheme, an approximator-parameter-to-RRAM-state mapping algorithm, and an RRAM state tuning scheme. Finally, the proposed RRAM-based computing framework is modeled at system level. A predictive compact model is developed to estimate the configuration overhead of RRAM-ACU and help explore the application scenarios of RRAM-based analog approximate computing. The simulation results on a set of diverse benchmarks demonstrate that, compared with a x86-64 CPU at 2 GHz, the RRAM-ACU is able to achieve 4.06-196.41× speedup and power efficiency of 24.59-567.98 GFLOPS/W with quality loss of 8.72% on average. And the implementation of hierarchical model and X application demonstrates that the proposed RRAM-based approximate computing framework can achieve 12.8× power efficiency than its pure digital implementation counterparts (CPU, graphics processing unit, and field- programmable gate arrays).
international symposium on neural networks | 2014
Boxun Li; Erjin Zhou; Bo Huang; Jiayi Duan; Yu Wang; Ningyi Xu; Jiaxing Zhang; Huazhong Yang
Large scale artificial neural networks (ANNs) have been widely used in data processing applications. The recurrent neural network (RNN) is a special type of neural network equipped with additional recurrent connections. Such a unique architecture enables the recurrent neural network to remember the past processed information and makes it an expressive model for nonlinear sequence processing tasks. However, the large computation complexity makes it difficult to effectively train a recurrent neural network and therefore significantly limits the research on the recurrent neural network in the last 20 years. In recent years, the use of graphics processing units (GPUs) becomes a significant advance to speed up the training process of large scale neural networks by taking advantage of the massive parallelism capabilities of GPUs. In this paper, we propose an efficient GPU implementation of the large scale recurrent neural network and demonstrate the power of scaling up the recurrent neural network with GPUs. We first explore the potential parallelism of the recurrent neural network and propose a fine-grained two-stage pipeline implementation. Experiment results show that the proposed GPU implementation can achieve 2 ~ 11 x speed-up compared with the basic CPU implementation with the Intel Math Kernel Library. We then use the proposed GPU implementation to scale up the recurrent neural network and improve its performance. The experiment results of the Microsoft Research Sentence Completion Challenge demonstrate that the large scale recurrent network without class layer is able to beat the traditional class-based modest-size recurrent network and achieve an accuracy of 47%, the best result achieved by a single recurrent neural network on the same dataset.
design automation conference | 2016
Lixue Xia; Tianqi Tang; Wenqin Huangfu; Ming Cheng; Xiling Yin; Boxun Li; Yu Wang; Huazhong Yang
Convolutional Neural Network (CNN) is a powerful technique widely used in computer vision area, which also demands much more computations and memory resources than traditional solutions. The emerging metal-oxide resistive random-access memory (RRAM) and RRAM crossbar have shown great potential on neuromorphic applications with high energy efficiency. However, the interfaces between analog RRAM crossbars and digital peripheral functions, namely Analog-to-Digital Converters (AD-Cs) and Digital-to-Analog Converters (DACs), consume most of the area and energy of RRAM-based CNN design due to the large amount of intermediate data in CNN. In this paper, we propose an energy efficient structure for RRAM-based CNN. Based on the analysis of data distribution, a quantization method is proposed to transfer the intermediate data into 1 bit and eliminate DACs. An energy efficient structure using input data as selection signals is proposed to reduce the ADC cost for merging results of multiple crossbars. The experimental results show that the proposed method and structure can save 80% area and more than 95% energy while maintaining the same or comparable classification accuracy of CNN on MNIST.
great lakes symposium on vlsi | 2015
Yu Wang; Tianqi Tang; Lixue Xia; Boxun Li; Peng Gu; Huazhong Yang; Hai Li; Yuan Xie
Inspired by the human brains function and efficiency, neuromorphic computing offers a promising solution for a wide set of tasks, ranging from brain machine interfaces to real-time classification. The spiking neural network (SNN), which encodes and processes information with bionic spikes, is an emerging neuromorphic model with great potential to drastically promote the performance and efficiency of computing systems. However, an energy efficient hardware implementation and the difficulty of training the model significantly limit the application of the spiking neural network. In this work, we address these issues by building an SNN-based energy efficient system for real time classification with metal-oxide resistive switching random-access memory (RRAM) devices. We implement different training algorithms of SNN, including Spiking Time Dependent Plasticity (STDP) and Neural Sampling method. Our RRAM SNN systems for these two training algorithms show good power efficiency and recognition performance on realtime classification tasks, such as the MNIST digit recognition. Finally, we propose a possible direction to further improve the classification accuracy by boosting multiple SNNs.
asia and south pacific design automation conference | 2015
Peng Gu; Boxun Li; Tianqi Tang; Shimeng Yu; Yu Cao; Yu Wang; Huazhong Yang
The matrix-vector multiplication is the key operation for many computationally intensive algorithms. In recent years, the emerging metal oxide resistive switching random access memory (RRAM) device and RRAM crossbar array have demonstrated a promising hardware realization of the analog matrix-vector multiplication with ultra-high energy efficiency. In this paper, we analyze the impact of nonlinear voltage-current relationship of RRAM devices and the interconnect resistance as well as other crossbar array parameters on the circuit performance and present a design guide. On top of that, we propose a technological exploration flow for device parameter configuration to overcome the impact of nonideal factors and achieve a better trade-off among performance, energy and reliability for each specific application. The simulation results of a support vector machine (SVM) and MNIST pattern recognition dataset show that the RRAM crossbar array-based SVM is robust to the input signal fluctuation but sensitive to the tunneling gap deviation. A further resistance resolution test presents that a 4-bit RRAM device is able to realize a recognition accuracy of ∼ 90%, indicating the physical feasibility of RRAM crossbar array-based SVM. In addition, the proposed technological exploration flow is able to achieve 10.98% improvement of recognition accuracy on the MNIST dataset and 26.4% energy savings compared with previous work.