Ao Ren | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ao Ren is active.

Explore More

Publication

Featured researches published by Ao Ren.

architectural support for programming languages and operating systems | 2017

SC-DCNN: Highly-Scalable Deep Convolutional Neural Network using Stochastic Computing

Ao Ren; Zhe Li; Caiwen Ding; Qinru Qiu; Yanzhi Wang; Ji Li; Xuehai Qian; Bo Yuan

With the recent advance of wearable devices and Internet of Things (IoTs), it becomes attractive to implement the Deep Convolutional Neural Networks (DCNNs) in embedded and portable systems. Currently, executing the software-based DCNNs requires high-performance servers, restricting the widespread deployment on embedded and mobile IoT devices. To overcome this obstacle, considerable research efforts have been made to develop highly-parallel and specialized DCNN accelerators using GPGPUs, FPGAs or ASICs. Stochastic Computing (SC), which uses a bit-stream to represent a number within [-1, 1] by counting the number of ones in the bit-stream, has high potential for implementing DCNNs with high scalability and ultra-low hardware footprint. Since multiplications and additions can be calculated using AND gates and multiplexers in SC, significant reductions in power (energy) and hardware footprint can be achieved compared to the conventional binary arithmetic implementations. The tremendous savings in power (energy) and hardware resources allow immense design space for enhancing scalability and robustness for hardware DCNNs. This paper presents SC-DCNN, the first comprehensive design and optimization framework of SC-based DCNNs, using a bottom-up approach. We first present the designs of function blocks that perform the basic operations in DCNN, including inner product, pooling, and activation function. Then we propose four designs of feature extraction blocks, which are in charge of extracting features from input feature maps, by connecting different basic function blocks with joint optimization. Moreover, the efficient weight storage methods are proposed to reduce the area and power (energy) consumption. Putting all together, with feature extraction blocks carefully selected, SC-DCNN is holistically optimized to minimize area and power (energy) consumption while maintaining high network accuracy. Experimental results demonstrate that the LeNet5 implemented in SC-DCNN consumes only 17 mm2 area and 1.53 W power, achieves throughput of 781250 images/s, area efficiency of 45946 images/s/mm2, and energy efficiency of 510734 images/J.

international conference on computer design | 2016

DSCNN: Hardware-oriented optimization for Stochastic Computing based Deep Convolutional Neural Networks

Zhe Li; Ao Ren; Ji Li; Qinru Qiu; Yanzhi Wang; Bo Yuan

Deep Convolutional Neural Networks (DCNN), a branch of Deep Neural Networks which use the deep graph with multiple processing layers, enables the convolutional model to finely abstract the high-level features behind an image. Large-scale applications using DCNN mainly operate in high-performance server clusters, GPUs or FPGA clusters; it is restricted to extend the applications onto mobile/wearable devices and Internet-of-Things (IoT) entities due to high power/energy consumption. Stochastic Computing is a promising method to overcome this shortcoming used in specific hardware-based systems. Many complex arithmetic operations can be implemented with very simple hardware logic in the SC framework, which alleviates the extensive computation complexity. The exploration of network-wise optimization and the revision of network structure with respect to stochastic computing based hardware design have not been discussed in previous work. In this paper, we investigate Deep Stochastic Convolutional Neural Network (DSCNN) for DCNN using stochastic computing. The essential calculation components using SC are designed and evaluated. We propose a joint optimization method to collaborate components guaranteeing a high calculation accuracy in each stage of the network. The structure of original DSCNN is revised to accommodate SC hardware designs simplicity. Experimental Results show that as opposed to software inspired feature extraction block in DSCNN, an optimized hardware oriented feature extraction block achieves as higher as 59.27% calculation precision. And the optimized DSCNN can achieve only 3.48% network test error rate compared to 27.83% for baseline DSCNN using software inspired feature extraction block.

asia and south pacific design automation conference | 2017

Towards acceleration of deep convolutional neural networks using stochastic computing

Ji Li; Ao Ren; Zhe Li; Caiwen Ding; Bo Yuan; Qinru Qiu; Yanzhi Wang

In recent years, Deep Convolutional Neural Network (DCNN) has become the dominant approach for almost all recognition and detection tasks and outperformed humans on certain tasks. Nevertheless, the high power consumptions and complex topologies have hindered the widespread deployment of DCNNs, particularly in wearable devices and embedded systems with limited area and power budget. This paper presents a fully parallel and scalable hardware-based DCNN design using Stochastic Computing (SC), which leverages the energy-accuracy trade-off through optimizing SC components in different layers. We first conduct a detailed investigation of the Approximate Parallel Counter (APC) based neuron and multiplexer-based neuron using SC, and analyze the impacts of various design parameters, such as bit stream length and input number, on the energy/power/area/accuracy of the neuron cell. Then, from an architecture perspective, the influence of inaccuracy of neurons in different layers on the overall DCNN accuracy (i.e., software accuracy of the entire DCNN) is studied. Accordingly, a structure optimization method is proposed for a general DCNN architecture, in which neurons in different layers are implemented with optimized SC components, so as to reduce the area, power, and energy of the DCNN while maintaining the overall network performance in terms of accuracy. Experimental results show that the proposed approach can find a satisfactory DCNN configuration, which achieves 55X, 151X, and 2X improvement in terms of area, power and energy, respectively, while the error is increased by 2.86%, compared with the conventional binary ASIC implementation.

design, automation, and test in europe | 2017

Structural design optimization for deep convolutional neural networks using stochastic computing

Zhe Li; Ao Ren; Ji Li; Qinru Qiu; Bo Yuan; Jeffrey Draper; Yanzhi Wang

Deep Convolutional Neural Networks (DCNNs) have been demonstrated as effective models for understanding image content. The computation behind DCNNs highly relies on the capability of hardware resources due to the deep structure. DCNNs have been implemented on different large-scale computing platforms. However, there is a trend that DCNNs have been embedded into light-weight local systems, which requires low power/energy consumptions and small hardware footprints. Stochastic Computing (SC) radically simplifies the hardware implementation of arithmetic units and has the potential to satisfy the small low-power needs of DCNNs. Local connectivities and down-sampling operations have made DCNNs more complex to be implemented using SC. In this paper, eight feature extraction designs for DCNNs using SC in two groups are explored and optimized in detail from the perspective of calculation precision, where we permute two SC implementations for inner-product calculation, two down-sampling schemes, and two structures of DCNN neurons. We evaluate the network in aspects of network accuracy and hardware performance for each DCNN using one feature extraction design out of eight. Through exploration and optimization, the accuracies of SC-based DCNNs are guaranteed compared with software implementations on CPU/GPU/binary-based ASIC synthesis, while area, power, and energy are significantly reduced by up to 776x, 190x, and 32835x.

2016 IEEE International Conference on Rebooting Computing (ICRC) | 2016

Designing reconfigurable large-scale deep learning systems using stochastic computing

Ao Ren; Zhe Li; Yanzhi Wang; Qinru Qiu; Bo Yuan

Deep Learning, as an important branch of machine learning and neural network, is playing an increasingly important role in a number of fields like computer vision, natural language processing, etc. However, large-scale deep learning systems mainly operate in high-performance server clusters, thus restricting the application extensions to personal or mobile devices. The solution proposed in this paper is taking advantage of the fantastic features of stochastic computing methods. Stochastic computing is a type of data representation and processing technique, which uses a binary bit stream to represent a probability number (by counting the number of ones in this bit stream). In the stochastic computing area, some key arithmetic operations such as additions or multiplications can be implemented with very simple components like AND gates or multiplexers, respectively. Thus it provides an immense design space for integrating a large amount of neurons and enabling fully parallel and scalable hardware implementations of large-scale deep learning systems. In this paper, we present a reconfigurable large-scale deep learning system based on stochastic computing technologies, including the design of the neuron, the convolution function, the back-propagation function and some other basic operations. And the network-on-chip technique is also proposed in this paper to achieve the goal of implementing a large-scale hardware system. Our experiments validate the functionality of reconfigurable deep learning systems using stochastic computing, and demonstrate that when the bit streams are set to be 8192 bits, classification of MNIST digits by stochastic computing can perform as low error rate as that by normal arithmetic operations.

international symposium on neural networks | 2017

Hardware-driven nonlinear activation for stochastic computing based deep convolutional neural networks

Ji Li; Zihao Yuan; Zhe Li; Caiwen Ding; Ao Ren; Qinru Qiu; Jeffrey Draper; Yanzhi Wang

Recently, Deep Convolutional Neural Networks (DCNNs) have made unprecedented progress, achieving the accuracy close to, or even better than human-level perception in various tasks. There is a timely need to map the latest software DCNNs to application-specific hardware, in order to achieve orders of magnitude improvement in performance, energy efficiency and compactness. Stochastic Computing (SC), as a low-cost alternative to the conventional binary computing paradigm, has the potential to enable massively parallel and highly scalable hardware implementation of DCNNs. One major challenge in SC based DCNNs is designing accurate nonlinear activation functions, which have a significant impact on the network-level accuracy but cannot be implemented accurately by existing SC computing blocks. In this paper, we design and optimize SC based neurons, and we propose highly accurate activation designs for the three most frequently used activation functions in software DCNNs, i.e, hyperbolic tangent, logistic, and rectified linear units. Experimental results on LeNet-5 using MNIST dataset demonstrate that compared with a binary ASIC hardware DCNN, the DCNN with the proposed SC neurons can achieve up to 61X, 151X, and 2X improvement in terms of area, power, and energy, respectively, at the cost of small precision degradation. In addition, the SC approach achieves up to 21X and 41X of the area, 41X and 72X of the power, and 198200X and 96443X of the energy, compared with CPU and GPU approaches, respectively, while the error is increased by less than 3.07%. ReLU activation is suggested for future SC based DCNNs considering its superior performance under a small bit stream length.

great lakes symposium on vlsi | 2017

Softmax Regression Design for Stochastic Computing Based Deep Convolutional Neural Networks

Zihao Yuan; Ji Li; Zhe Li; Caiwen Ding; Ao Ren; Bo Yuan; Qinru Qiu; Jeffrey Draper; Yanzhi Wang

Recently, Deep Convolutional Neural Networks (DCNNs) have made tremendous advances, achieving close to or even better accuracy than human-level perception in various tasks. Stochastic Computing (SC), as an alternate to the conventional binary computing paradigm, has the potential to enable massively parallel and highly scalable hardware implementations of DCNNs. In this paper, we design and optimize the SC based Softmax Regression function. Experiment results show that compared with a binary SR, the proposed SC-SR under longer bit stream can reach the same level of accuracy with the improvement of 295X, 62X, 2617X in terms of power, area and energy, respectively. Binary SR is suggested for future DCNNs with short bit stream length input whereas SC-SR is recommended for longer bit stream.

international conference on acoustics, speech, and signal processing | 2017

Ultra-fast robust compressive sensing based on memristor crossbars

Sijia Liu; Ao Ren; Yanzhi Wang; Pramod K. Varshney

In this paper, we propose a new approach for robust compressive sensing (CS) using memristor crossbars that are constructed by recently invented memristor devices. The exciting features of a memristor crossbar, such as high density, low power and great scalability, make it a promising candidate to perform large-scale matrix operations. To apply memristor crossbars to solve a robust CS problem, the alternating directions method of multipliers (ADMM) is employed to split the original problem into subproblems that involve the solution of systems of linear equations. A system of linear equations can then be solved using memristor crossbars with astonishing O(1) time complexity. We also study the impact of hardware variations on the memristor crossbar based CS solver from both theoretical and practical points of view. The resulting overall complexity is given by O(n), which achieves O(n2.5) speed-up compared to the state-of-the-art software approach. Numerical results are provided to illustrate the effectiveness of the proposed CS solver.

system on chip conference | 2016

Design of high-speed low-power polar BP decoder using emerging technologies

Ao Ren; Bo Yuan; Yanzhi Wang

With the provable asymptotically optimal property in the sense of channel capacity, polar codes have emerged as one of the potential channel code candidates for the next-generation wireless communication systems. To date, numerous works have been reported on VLSI implementation of polar decoders. However, the efficient hardware design of polar decoders that can be used in the real-time energy-constraint mobile devices is still a huge challenge. This paper, for the first time, investigates high-speed low-power polar decoder design using FinFET and near-threshold computing (NTC) technologies. Specifically, the hardware performance of polar decoder with belief propagation (BP) decoding approach is studied and evaluated. Synthesis results show that the joint use of these two emerging technologies can lead to significant reduction in both power consumption and decoding delay at the same time.

great lakes symposium on vlsi | 2018

Structured Weight Matrices-Based Hardware Accelerators in Deep Neural Networks: FPGAs and ASICs

Caiwen Ding; Ao Ren; Geng Yuan; Xiaolong Ma; Jiayu Li; Ning Liu; Bo Yuan; Yanzhi Wang

Both industry and academia have extensively investigated hardware accelerations. To address the demands in increasing computational capability and memory requirement, in this work, we propose the structured weight matrices (SWM)-based compression technique for both Field Programmable Gate Array (FPGA) and application-specific integrated circuit (ASIC) implementations. In the algorithm part, the SWM-based framework adopts block-circulant matrices to achieve a fine-grained tradeoff between accuracy and compression ratio. The SWM-based technique can reduce computational complexity from O(n2) to O(nlog n) and storage complexity from O(n2) to O(n) for each layer and both training and inference phases. For FPGA implementations on deep convolutional neural networks (DCNNs), we achieve at least 152X and 72X improvement in performance and energy efficiency, respectively using the SWM-based framework, compared with the baseline of IBM TrueNorth processor under same accuracy constraints using the data set of MNIST, SVHN, and CIFAR-10. For FPGA implementations on long short term memory (LSTM) networks, the proposed SWM-based LSTM can achieve up to 21X enhancement in performance and 33.5X gains in energy efficiency compared with the ESE accelerator. For ASIC implementations, the proposed SWM-based ASIC design exhibits impressive advantages in terms of power, throughput, and energy efficiency. Experimental results indicate that this method is greatly suitable for applying DNNs onto both FPGAs and mobile/IoT devices.

Explore More