Zidong Du | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Zidong Du is active.

Explore More

Publication

Featured researches published by Zidong Du.

architectural support for programming languages and operating systems | 2014

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

Tianshi Chen; Zidong Du; Ninghui Sun; Jia Wang; Chueh-Hung Wu; Yunji Chen; Olivier Temam

Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65 nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

international symposium on computer architecture | 2015

ShiDianNao: shifting vision processing closer to the sensor

Zidong Du; Robert Fasthuber; Tianshi Chen; Paolo Ienne; Ling Fei Li; Tao Luo; Xiaobing Feng; Yunji Chen; Olivier Temam

In recent years, neural network accelerators have been shown to achieve both high energy efficiency and high performance for a broad application scope within the important category of recognition and mining applications. Still, both the energy efficiency and peiformance of such accelerators remain limited by memory accesses. In this paper, we focus on image applications, arguably the most important category among recognition and mining applications. The neural networks which are state-of-the-art for these applications are Convolutional Neural Networks (CNN), and they have an important property: weights are shared among many neurons, considerably reducing the neural network memory footprint. This property allows to entirely map a CNN within an SRAM, eliminating all DRAM accesses for weights. By further hoisting this accelerator next to the image sensor, it is possible to eliminate all remaining DRAM accesses, i.e., for inputs and outputs. In this paper, we propose such a CNN accelerator, placed next to a CMOS or CCD sensor. The absence of DRAM accesses combined with a careful exploitation of the specific data access patterns within CNNs allows us to design an accelerator which is 60x more energy efficient than the previous state-of-the-art neural network accelerator. We present a fult design down to the layout at 65 nm, with a modest footprint of 4.86 mm2 and consuming only 320 mW, but still about 30x faster than high-end GPUs.

asia and south pacific design automation conference | 2014

Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators

Zidong Du; Avinash Lingamneni; Yunji Chen; Krishna V. Palem; Olivier Temam; Chueh-Hung Wu

In recent years, inexact computing has been increasingly regarded as one of the most promising approaches for reducing energy consumption in many applications that can tolerate a degree of inaccuracy. Driven by the principle of trading tolerable amounts of application accuracy in return for significant resource savings - the energy consumed, the (critical path) delay and the (silicon) area being the resources - this approach has been limited to certain application domains. In this paper, we propose to expand the application scope, error tolerance as well as the energy savings of inexact computing systems through neural network architectures. Such neural networks are fast emerging as popular candidate accelerators for future heterogeneous multi-core platforms, and have flexible error tolerance limits owing to their ability to be trained. Our results based on simulated 65nm technology designs demonstrate that the proposed inexact neural network accelerator could achieve 43.91%-62.49% savings in energy consumption (with corresponding delay and area savings being 18.79% and 31.44% respectively) when compared to existing baseline neural network implementation, at the cost of an accuracy loss (quantified as the Mean Square Error (MSE) which increases from 0.14 to 0.20 on average).

international symposium on computer architecture | 2016

Cambricon: an instruction set architecture for neural networks

Shaoli Liu; Zidong Du; Jinhua Tao; Dong Han; Tao Luo; Yuan Xie; Yunji Chen; Tianshi Chen

Neural Networks (NN) are a family of models for a broad range of emerging machine learning and pattern recondition applications. NN techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which are usually not energy-efficient since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators for neural networks have been proposed recently to improve the energy-efficiency. However, such accelerators were designed for a small set of NN techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an NN (such as layers), or even an NN as a whole. Although straightforward and easy-to-implement for a limited set of similar NN techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different NN techniques with sufficient flexibility and efficiency. In this paper, we propose a novel domain-specific Instruction Set Architecture (ISA) for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. Our evaluation over a total of ten representative yet distinct NN techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of NN techniques, and provides higher code density than general-purpose ISAs such as ×86, MIPS, and GPGPU. Compared to the latest state-of-the-art NN accelerator design DaDianNao [5] (which can only accommodate 3 types of NN techniques), our Cambricon-based accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency/power/area overheads, with a versatile coverage of 10 different NN benchmarks.

international symposium on microarchitecture | 2016

Cambricon-x: an accelerator for sparse neural networks

Shijin Zhang; Zidong Du; Lei Zhang; Huiying Lan; Shaoli Liu; Ling Li; Qi Guo; Tianshi Chen; Yunji Chen

Neural networks (NNs) have been demonstrated to be useful in a broad range of applications such as image recognition, automatic translation and advertisement recommendation. State-of-the-art NNs are known to be both computationally and memory intensive, due to the ever-increasing deep structure, i.e., multiple layers with massive neurons and connections (i.e., synapses). Sparse neural networks have emerged as an effective solution to reduce the amount of computation and memory required. Though existing NN accelerators are able to efficiently process dense and regular networks, they cannot benefit from the reduction of synaptic weights. In this paper, we propose a novel accelerator, Cambricon-X, to exploit the sparsity and irregularity of NN models for increased efficiency. The proposed accelerator features a PE-based architecture consisting of multiple Processing Elements (PE). An Indexing Module (IM) efficiently selects and transfers needed neurons to connected PEs with reduced bandwidth requirement, while each PE stores irregular and compressed synapses for local computation in an asynchronous fashion. With 16 PEs, our accelerator is able to achieve at most 544 GOP/s in a small form factor (6.38 mm2 and 954 mW at 65 nm). Experimental results over a number of representative sparse networks show that our accelerator achieves, on average, 7.23x speedup and 6.43x energy saving against the state-of-the-art NN accelerator.

international symposium on microarchitecture | 2015

Neuromorphic accelerators: a comparison between neuroscience and machine-learning approaches

Zidong Du; Daniel David Ben-Dayan Rubin; Yunji Chen; Liqiang Hel; Tianshi Chen; Lei Zhang; Chueh-Hung Wu; Olivier Temam

A vast array of devices, ranging from industrial robots to self-driven cars or smartphones, require increasingly sophisticated processing of real-world input data (image, voice, radio, …). Interestingly, hardware neural network accelerators are emerging again as attractive candidate architectures for such tasks. The neural network algorithms considered come from two, largely separate, domains: machine-learning and neuroscience. These neural networks have very different characteristics, so it is unclear which approach should be favored for hardware implementation. Yet, few studies compare them from a hardware perspective. We implement both types of networks down to the layout, and we compare the relative merit of each approach in terms of energy, speed, area cost, accuracy and functionality. Within the limit of our study (current SNN and machine learning NN algorithms, current best effort at hardware implementation efforts, and workloads used in this study), our analysis helps dispel the notion that hardware neural network accelerators inspired from neuroscience, such as SNN+STDP, are currently a competitive alternative to hardware neural networks accelerators inspired from machine-learning, such as MLP+BP: not only in terms of accuracy, but also in terms of hardware cost for realistic implementations, which is less expected. However, we also outline that SNN+STDP carry potential for reduced hardware cost compared to machine-learning networks at very large scales, if accuracy issues can be controlled (or for applications where they are less important). We also identify the key sources of inaccuracy of SNN+STDP which are less related to the loss of information due to spike coding than to the nature of the STDP learning algorithm. Finally, we outline that for the category of applications which require permanent online learning and moderate accuracy, SNN+STDP hardware accelerators could be a very cost-efficient solution.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2015

Leveraging the Error Resilience of Neural Networks for Designing Highly Energy Efficient Accelerators

Zidong Du; Avinash Lingamneni; Yunji Chen; Krishna V. Palem; Olivier Temam; Chueh-Hung Wu

In recent years, inexact computing has been increasingly regarded as one of the most promising approaches for slashing energy consumption in many applications that can tolerate a certain degree of inaccuracy. Driven by the principle of trading tolerable amounts of application accuracy in return for significant resource savings-the energy consumed, the (critical path) delay, and the (silicon) area-this approach has been limited to application-specified integrated circuits (ASICs) so far. These ASIC realizations have a narrow application scope and are often rigid in their tolerance to inaccuracy, as currently designed; the latter often determining the extent of resource savings we would achieve. In this paper, we propose to improve the application scope, error resilience and the energy savings of inexact computing by combining it with hardware neural networks. These neural networks are fast emerging as popular candidate accelerators for future heterogeneous multicore platforms and have flexible error resilience limits owing to their ability to be trained. Our results in 65-nm technology demonstrate that the proposed inexact neural network accelerator could achieve 1.78-2.67× savings in energy consumption (with corresponding delay and area savings being 1.23 and 1.46×, respectively) when compared to the existing baseline neural network implementation, at the cost of a small accuracy loss (mean squared error increases from 0.14 to 0.20 on average).

IEEE Micro | 2015

A High-Throughput Neural Network Accelerator

Tianshi Chen; Zidong Du; Ninghui Sun; Jia Wang; Chueh-Hung Wu; Yunji Chen; Olivier Temam

The authors designed an accelerator architecture for large-scale neural networks, with an emphasis on the impact of memory on accelerator design, performance, and energy. In this article, they present a concrete design at 65 nm that can perform 496 16-bit fixed-point operations in parallel every 1.02 ns, that is, 452 gop/s, in a 3.02mm2, 485-mw footprint (excluding main memory accesses).

design, automation, and test in europe | 2015

Retraining-based timing error mitigation for hardware neural networks

Jiachao Deng; Yuntan Fang; Zidong Du; Ying Wang; Huawei Li; Olivier Temam; Paolo Ienne; David Novo; Xiaowei Li; Yunji Chen; Chueh-Hung Wu

Recently, neural network (NN) accelerators are gaining popularity as part of future heterogeneous multi-core architectures due to their broad application scope and excellent energy efficiency. Additionally, since neural networks can be retrained, they are inherently resillient to errors and noises. Prior work has utilized the error tolerance feature to design approximate neural network circuits or tolerate logical faults. However, besides high-level faults or noises, timing errors induced by delay faults, process variations, aging, etc. are dominating the reliability of NN accelerator under nanoscale manufacturing process. In this paper, we leverage the error resiliency of neural network to mitigate timing errors in NN accelerators. Specifically, when timing errors significantly affect the output results, we propose to retrain the accelerators to update their weights, thus circumventing critical timing errors. Experimental results show that timing errors in NN accelerators can be well tamed for different applications.

ACM Transactions on Computer Systems | 2015

A Small-Footprint Accelerator for Large-Scale Neural Networks

Tianshi Chen; Shijin Zhang; Shaoli Liu; Zidong Du; Tao Luo; Yuan Gao; Junjie Liu; Dongsheng Wang; Chueh-Hung Wu; Ninghui Sun; Yunji Chen; Olivier Temam

Machine-learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve toward heterogeneous multicores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have been focusing on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance, and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02mm2 and 485mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87 × faster, and it can reduce the total energy by 21.08 ×. The accelerator characteristics are obtained after layout at 65nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

Explore More