Tianqi Tang
Tsinghua University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Tianqi Tang.
field programmable gate arrays | 2016
Jiantao Qiu; Jie Wang; Song Yao; Kaiyuan Guo; Boxun Li; Erjin Zhou; Jincheng Yu; Tianqi Tang; Ningyi Xu; Sen Song; Yu Wang; Huazhong Yang
In recent years, convolutional neural network (CNN) based methods have achieved great success in a large number of applications and have been among the most powerful and widely used techniques in computer vision. However, CNN-based methods are com-putational-intensive and resource-consuming, and thus are hard to be integrated into embedded systems such as smart phones, smart glasses, and robots. FPGA is one of the most promising platforms for accelerating CNN, but the limited bandwidth and on-chip memory size limit the performance of FPGA accelerator for CNN. In this paper, we go deeper with the embedded FPGA platform on accelerating CNNs and propose a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification. We first present an in-depth analysis of state-of-the-art CNN models and show that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric. Then the dynamic-precision data quantization method and a convolver design that is efficient for all layer types in CNN are proposed to improve the bandwidth and resource utilization. Results show that only 0.4% accuracy loss is introduced by our data quantization flow for the very deep VGG16 model when 8/4-bit quantization is used. A data arrangement method is proposed to further ensure a high utilization of the external memory bandwidth. Finally, a state-of-the-art CNN, VGG16-SVD, is implemented on an embedded FPGA platform as a case study. VGG16-SVD is the largest and most accurate network that has been implemented on FPGA end-to-end so far. The system on Xilinx Zynq ZC706 board achieves a frame rate at 4.45 fps with the top-5 accuracy of 86.66% using 16-bit quantization. The average performance of convolutional layers and the full CNN is 187.8 GOP/s and 137.0 GOP/s under 150MHz working frequency, which outperform previous approaches significantly.
design automation conference | 2016
Lixue Xia; Tianqi Tang; Wenqin Huangfu; Ming Cheng; Xiling Yin; Boxun Li; Yu Wang; Huazhong Yang
Convolutional Neural Network (CNN) is a powerful technique widely used in computer vision area, which also demands much more computations and memory resources than traditional solutions. The emerging metal-oxide resistive random-access memory (RRAM) and RRAM crossbar have shown great potential on neuromorphic applications with high energy efficiency. However, the interfaces between analog RRAM crossbars and digital peripheral functions, namely Analog-to-Digital Converters (AD-Cs) and Digital-to-Analog Converters (DACs), consume most of the area and energy of RRAM-based CNN design due to the large amount of intermediate data in CNN. In this paper, we propose an energy efficient structure for RRAM-based CNN. Based on the analysis of data distribution, a quantization method is proposed to transfer the intermediate data into 1 bit and eliminate DACs. An energy efficient structure using input data as selection signals is proposed to reduce the ADC cost for merging results of multiple crossbars. The experimental results show that the proposed method and structure can save 80% area and more than 95% energy while maintaining the same or comparable classification accuracy of CNN on MNIST.
great lakes symposium on vlsi | 2015
Yu Wang; Tianqi Tang; Lixue Xia; Boxun Li; Peng Gu; Huazhong Yang; Hai Li; Yuan Xie
Inspired by the human brains function and efficiency, neuromorphic computing offers a promising solution for a wide set of tasks, ranging from brain machine interfaces to real-time classification. The spiking neural network (SNN), which encodes and processes information with bionic spikes, is an emerging neuromorphic model with great potential to drastically promote the performance and efficiency of computing systems. However, an energy efficient hardware implementation and the difficulty of training the model significantly limit the application of the spiking neural network. In this work, we address these issues by building an SNN-based energy efficient system for real time classification with metal-oxide resistive switching random-access memory (RRAM) devices. We implement different training algorithms of SNN, including Spiking Time Dependent Plasticity (STDP) and Neural Sampling method. Our RRAM SNN systems for these two training algorithms show good power efficiency and recognition performance on realtime classification tasks, such as the MNIST digit recognition. Finally, we propose a possible direction to further improve the classification accuracy by boosting multiple SNNs.
asia and south pacific design automation conference | 2015
Peng Gu; Boxun Li; Tianqi Tang; Shimeng Yu; Yu Cao; Yu Wang; Huazhong Yang
The matrix-vector multiplication is the key operation for many computationally intensive algorithms. In recent years, the emerging metal oxide resistive switching random access memory (RRAM) device and RRAM crossbar array have demonstrated a promising hardware realization of the analog matrix-vector multiplication with ultra-high energy efficiency. In this paper, we analyze the impact of nonlinear voltage-current relationship of RRAM devices and the interconnect resistance as well as other crossbar array parameters on the circuit performance and present a design guide. On top of that, we propose a technological exploration flow for device parameter configuration to overcome the impact of nonideal factors and achieve a better trade-off among performance, energy and reliability for each specific application. The simulation results of a support vector machine (SVM) and MNIST pattern recognition dataset show that the RRAM crossbar array-based SVM is robust to the input signal fluctuation but sensitive to the tunneling gap deviation. A further resistance resolution test presents that a 4-bit RRAM device is able to realize a recognition accuracy of ∼ 90%, indicating the physical feasibility of RRAM crossbar array-based SVM. In addition, the proposed technological exploration flow is able to achieve 10.98% improvement of recognition accuracy on the MNIST dataset and 26.4% energy savings compared with previous work.
design, automation, and test in europe | 2016
Lixue Xia; Boxun Li; Tianqi Tang; Peng Gu; Xiling Yin; Wenqin Huangfu; Pai Yu Chen; Shimeng Yu; Yu Cao; Yu Wang; Yuan Xie; Huazhong Yang
Memristor-based neuromorphic computing system provides a promising solution to significantly boost the power efficiency of computing system. Memristor-based neuromorphic computing system has a wide range of design choices, such as the various memristor crossbar cell designs and different parallelism degrees of peripheral circuits. However, a memristor-based neuromorphic computing system simulator, which is able to model the system and realize an early-stage design space exploration, is still missing. In this paper, we develop a memristor-based neuromorphic system simulation platform (MNSIM). MNSIM proposes a general hierarchical structure for memristor-based neuromophic computing system, and provides flexible interface for users to customize the design. MNSIM also provides a detailed reference design for large-scale applications. MNSIM embeds estimation models of area, power, and latency to simulate the performance of system. To estimate the computing accuracy, MNSIM proposes a behavior-level model between computing error rate and crossbar design parameters considering the influence of interconnect lines and non-ideal device factors. The error rate between our accuracy model and SPICE simulation result is less than 1%. Experimental results show that MNSIM achieves more than 7000 times speed-up compared with SPICE and obtains reasonable accuracy. MNSIM can further estimate the trade-off between computing accuracy, energy, latency, and area among different designs for optimization.
design, automation, and test in europe | 2015
Tianqi Tang; Lixue Xia; Boxun Li; Rong Luo; Yiran Chen; Yu Wang; Huazhong Yang
The spiking neural network (SNN) provides a promising solution to drastically promote the performance and efficiency of computing systems. Previous work of SNN mainly focus on increasing the scalability and level of realism in a neural simulation, while few of them support practical cognitive applications with acceptable performance. At the same time, based on the traditional CMOS technology, the efficiency of SNN systems is also unsatisfactory. In this work, we explore different training algorithms of SNN for real-world applications, and demonstrate that the Neural Sampling method is much more effective than Spiking Time Dependent Plasticity (STDP) and Remote Supervision Method (ReSuMe). We also propose an energy efficient implementation of SNN with the emerging metal-oxide resistive random access memory (RRAM) devices, which includes an RRAM crossbar array works as network synapses, an analog design of the spike neuron, and an input encoding scheme. A parameter mapping algorithm is also introduced to configure the RRAM-based SNN. Simulation results illustrate that the system achieves 91.2% accuracy on the MNIST dataset with an ultra-low power consumption of 3.5mW. Moreover, the RRAM-based SNN system demonstrates great robustness to 20% process variation with less than 1% accuracy decrease, and can tolerate 20% signal fluctuation with about 2% accuracy loss. These results reveal that the RRAM-based SNN will be quite easy to be physically realized.
international symposium on circuits and systems | 2016
Yu Wang; Lixue Xia; Tianqi Tang; Boxun Li; Song Yao; Ming Cheng; Huazhong Yang
Deep learning, and especially Convolutional Neural Network (CNN, is among the most powerful and widely used techniques in computer vision. Applications range from image classification to object detection, segmentation, Optical Character Recognition (OCR), etc. At the same time, CNNs are both computationally intensive and memory intensive, making them difficult to be deployed on low power lightweight embedded systems. In this work, we introduce an on-chip convoltional neural network implementation for low-power embedded system. We point out that the high precision of weights limits the low-power CNN implementation on both FPGA and RRAM platform. A dynamic quantization method is introduced to reduce the precision while maintaining the same or comparable accuracy at the same time. Finally, the de ailed designs of low-power FPGA-based CNN and RRAM-based CNN are provided and compared. The results show that FPGA-based design gets 2× energy efficiency compared with GPU implementation, and toe RRAM-based design can further obtain more than 40× energy efficiency gains.
international symposium on circuits and systems | 2015
Deming Zhang; Lang Zeng; Yuanzhuo Qu; Youguang; Zhang Mengxing Wang; Weisheng Zhao; Tianqi Tang; Yu Wang
Recently, magnetic tunnel junction with in-plane magnetization (i-MTJ) has been exploited to behave as a binary stochastic synapse. However, it suffers from its limited level of synaptic weight, resulting in an inaccurate learning. In this work, a compound synapse that employs multiple perpendicular MTJs (p-MTJs) in series is proposed. It possesses an analog-like synaptic weight under weak programming conditions, which leads to a stochastic learning rule and low power consumption per synaptic event. By performing system-level simulations on the MNIST database, it has been demonstrated that such compound spin synapses can realize stochastic neuromorphic computation with high accuracy and low energy consumption.
asia and south pacific design automation conference | 2017
Tianqi Tang; Lixue Xia; Boxun Li; Yu Wang; Huazhong Yang
Recent progress in the machine learning field makes low bit-level Convolutional Neural Networks (CNNs), even CNNs with binary weights and binary neurons, achieve satisfying recognition accuracy on ImageNet dataset. Binary CNNs (BCNNs) make it possible for introducing low bit-level RRAM devices and low bit-level ADC/DAC interfaces in RRAM-based Computing System (RCS) design, which leads to faster read-and-write operations and better energy efficiency than before. However, some design challenges still exist: (1) how to make matrix splitting when one crossbar is not large enough to hold all parameters of one layer; (2) how to design the pipeline to accelerate the whole CNN forward process. In this paper, an RRAM crossbar-based accelerator is proposed for BCNN forward process. Moreover, the special design for BCNN is well discussed, especially the matrix splitting problem and the pipeline implementation. In our experiment, BCNNs on RRAM show much smaller accuracy loss than multi-bit CNNs for LeNet on MNIST when considering device variation. For AlexNet on ImageNet, the RRAM-based BCNN accelerator saves 58.2% energy consumption and 56.8% area compared with multi-bit CNN structure.
international conference on computer aided design | 2015
Yung-Hsiang Lu; Alan M. Kadin; Alexander C. Berg; Thomas M. Conte; Erik P. DeBenedictis; Rachit Garg; Ganesh Gingade; Bichlien Hoang; Yongzhen Huang; Boxun Li; Jingyu Liu; Wei Liu; Huizi Mao; Junran Peng; Tianqi Tang; Elie K. Track; Jingqiu Wang; Tao Wang; Yu Wang; Jun Yao
“Rebooting Computing” (RC) is an effort in the IEEE to rethink future computers. RC started in 2012 by the co-chairs, Elie Track (IEEE Council on Superconductivity) and Tom Conte (Computer Society). RC takes a holistic approach, considering revolutionary as well as evolutionary solutions needed to advance computer technologies. Three summits have been held in 2013 and 2014, discussing different technologies, from emerging devices to user interface, from security to energy efficiency, from neuromorphic to reversible computing. The first part of this paper introduces RC to the design automation community and solicits revolutionary ideas from the community for the directions of future computer research. Energy efficiency is identified as one of the most important challenges in future computer technologies. The importance of energy efficiency spans from miniature embedded sensors to wearable computers, from individual desktops to data centers. To gauge the state of the art, the RC Committee organized the first Low Power Image Recognition Challenge (LPIRC). Each image contains one or multiple objects, among 200 categories. A contestant has to provide a working system that can recognize the objects and report the bounding boxes of the objects. The second part of this paper explains LPIRC and the solutions from the top two winners.