Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Leibin Ni is active.

Publication


Featured researches published by Leibin Ni.


IEEE Transactions on Nanotechnology | 2015

An Energy-Efficient Nonvolatile In-Memory Computing Architecture for Extreme Learning Machine by Domain-Wall Nanowire Devices

Yuhao Wang; Hao Yu; Leibin Ni; Guang-Bin Huang; Mei Yan; Chuliang Weng; Wei Yang; Junfeng Zhao

The data-oriented applications have introduced increased demands on memory capacity and bandwidth, which raises the need to rethink the architecture of the current computing platforms. The logic-in-memory architecture is highly promising as future logic-memory integration paradigm for high throughput data-driven applications. From memory technology aspect, as one recently introduced nonvolatile memory device, domain-wall nanowire (or race-track) not only shows potential as future power efficient memory, but also computing capacity by its unique physics of spintronics. This paper explores a novel distributed in-memory computing architecture where most logic functions are executed within the memory, which significantly alleviates the bandwidth congestion issue and improves the energy efficiency. The proposed distributed in-memory computing architecture is purely built by domain-wall nanowire, i.e., both memory and logic are implemented by domain-wall nanowire devices. As a case study, neural network-based image resolution enhancement algorithm, called DW-NN, is examined within the proposed architecture. We show that all operations involved in machine learning on neural network can be mapped to a logic-in-memory architecture by nonvolatile domain-wall nanowire. Domain-wall nanowire-based logic is customized for in machine learning within image data storage. As such, both neural network training and processing can be performed locally within the memory. The experimental results show that the domain-wall memory can reduce 92% leakage power and 16% dynamic power compared to main memory implemented by DRAM; and domain-wall logic can reduce 31% both dynamic and 65% leakage power under the similar performance compared to CMOS transistor-based logic. And system throughput in DW-NN is improved by 11.6x and the energy efficiency is improved by 56x when compared to conventional image processing system.


asia and south pacific design automation conference | 2016

An energy-efficient matrix multiplication accelerator by distributed in-memory computing on binary RRAM crossbar

Leibin Ni; Yuhao Wang; Hao Yu; Wei Yang; Chuliang Weng; Junfeng Zhao

Emerging resistive random-access memory (RRAM) can provide non-volatile memory storage but also intrinsic logic for matrix-vector multiplication, which is ideal for low-power and high-throughput data analytics accelerator performed in memory. However, the existing RRAM-based computing device is mainly assumed on a multi-level analog computing, whose result is sensitive to process non-uniformity as well as additional AD- conversion and I/O overhead. This paper explores the data analytics accelerator on binary RRAM-crossbar. Accordingly, one distributed in-memory computing architecture is proposed with design of according component and control protocol. Both memory array and logic accelerator can be implemented by RRAM-crossbar purely in binary, where logic-memory pairs can be distributed with protocol of control bus. Based on numerical results for fingerprint matching that is mapped on the proposed RRAM-crossbar, the proposed architecture has shown 2.86x faster speed, 154x better energy efficiency, and 100x smaller area when compared to the same design by CMOS-based ASIC.


IEEE Transactions on Information Forensics and Security | 2016

DW-AES: A Domain-Wall Nanowire-Based AES for High Throughput and Energy-Efficient Data Encryption in Non-Volatile Memory

Yuhao Wang; Leibin Ni; Chip-Hong Chang; Hao Yu

Big-data storage poses significant challenges to anonymization of sensitive information against data sniffing. Not only will the encryption bandwidth be limited by the I/O traffic, the transfer of data between the processor and the memory will also expose the input-output mapping of intermediate computations on I/O channels that are susceptible to semi-invasive and non-invasive attacks. Limited by the simplistic cell-level logic, existing logic-in-memory computing architectures are incapable of performing the complete encryption process within the memory at reasonable throughput and energy efficiency. In this paper, a block-level in-memory architecture for advanced encryption standard (AES) is proposed. The proposed technique, called DW-AES, maps all AES operations directly to the domain-wall nanowires. The entire encryption process can be completed within a homogeneous, high-density, and standby-power-free non-volatile spintronic-based memory array without exposing the intermediate results to external I/O interface. Domain-wall nanowire-based pipelining and multi-issue pipelining methods are also proposed to increase the throughput of the baseline DW-AES with an insignificant area overhead and negligible difference on leakage power and energy consumption. The experimental results show that DW-AES can reduce the leakage power and area by the orders of magnitude compared with existing CMOS ASIC accelerators. It has an energy efficiency of 22 pJ/b, which is 5× and 3× better than the CMOS ASIC and memristive CMOL-based implementations, respectively. Under the same area budget, the proposed DW-AES achieves 4.6× higher throughput than the latest CMOS ASIC AES with similar power consumption. The throughput improvement increases to 11× for pipelined DW-AES at the expense of doubling the power consumption.


ACM Journal on Emerging Technologies in Computing Systems | 2017

Distributed In-Memory Computing on Binary RRAM Crossbar

Leibin Ni; Hantao Huang; Zichuan Liu; Rajiv V. Joshi; Hao Yu

Emerging resistive random-access memory (RRAM) can provide non-volatile memory storage but also intrinsic logic for matrix-vector multiplication, which is ideal for low-power and high-throughput data analytics accelerator performed in memory. However, the existing RRAM-based computing device is mainly assumed on a multi-level analog computing, whose result is sensitive to process non-uniformity as well as additional AD- conversion and I/O overhead. This paper explores the data analytics accelerator on binary RRAM-crossbar. Accordingly, one distributed in-memory computing architecture is proposed with design of according component and control protocol. Both memory array and logic accelerator can be implemented by RRAM-crossbar purely in binary, where logic-memory pairs can be distributed with protocol of control bus. Based on numerical results for fingerprint matching that is mapped on the proposed RRAM-crossbar, the proposed architecture has shown 2.86x faster speed, 154x better energy efficiency, and 100x smaller area when compared to the same design by CMOS-based ASIC.


international symposium on low power electronics and design | 2015

Optimizing Boolean embedding matrix for compressive sensing in RRAM crossbar

Yuhao Wang; Xin Li; Hao Yu; Leibin Ni; Wei Yang; Chuliang Weng; Junfeng Zhao

The emerging resistive random-access-memory (RRAM) crossbar provides an intrinsic fabric for matrix-vector multiplication, which can be leveraged as power efficient linear embedding hardware for data analytics such as compressive sensing. As the matrix elements are represented by resistance of RRAM cells, it imposes constraints for the embedding matrix due to limited RRAM programming resolution. A random Boolean embedding can be efficiently mapped to the RRAM crossbar but suffers from poor performance. Learning-based embedding matrices can deliver optimized performance but are continuous-valued which prevents it from being mapped to RRAM crossbar structure directly. In this paper, we have proposed one algorithm that can find an optimal Boolean embedding matrix for a given learned real-valued embedding matrix, so that it can be effectively mapped to the RRAM crossbar structure while high performance is preserved. The numerical experiments demonstrate that the proposed optimized Boolean embedding can reduce the embedding distortion by 2.7x, and image recovery error by 2.5x compared to the random Boolean embedding, both mapped on RRAM crossbar. In addition, optimized Boolean embedding on RRAM crossbar exhibits 10x faster speed, 17x better energy efficiency, and three orders of magnitude smaller area with slight accuracy penalty, when compared to the optimized real-valued embedding on CMOS ASIC platform.


IEEE Journal on Exploratory Solid-State Computational Devices and Circuits | 2017

An Energy-Efficient Digital ReRAM-Crossbar-Based CNN With Bitwise Parallelism

Leibin Ni; Zichuan Liu; Hao Yu; Rajiv V. Joshi

There is great attention to develop hardware accelerator with better energy efficiency, as well as throughput, than GPUs for convolutional neural network (CNN). The existing solutions have relatively limited parallelism as well as large power consumption (including leakage power). In this paper, we present a resistive random access memory (ReRAM)-accelerated CNN that can achieve significantly higher throughput and energy efficiency when the CNN is trained with binary constraints on both weights and activations, and is further mapped on a digital ReRAM-crossbar. We propose an optimized accelerator architecture tailored for bitwise convolution that features massive parallelism with high energy efficiency. Numerical experiment results show that the binary CNN accelerator on a digital ReRAM-crossbar achieves a peak throughput of 792 GOPS at the power consumption of 4.5 mW, which is 1.61 times faster and 296 times more energy-efficient than a high-end GPU.


international symposium on circuits and systems | 2016

On-line machine learning accelerator on digital RRAM-crossbar

Leibin Ni; Hantao Huang; Hao Yu

On-line machine learning has become the need for future data analytics. This work will show an ℓ2 norm based hardware solver for on-line machine learning that can significantly reduce training time when compared to the traditional gradient-based solution using backward propagation. We will show that the intensive matrix-vector multiplication in ℓ2 norm solution can be mapped onto a distributed in-memory accelerator using the recent resistive switching random access memory (RRAM) device. A digitized matrix-vector multiplication accelerator will be developed based on the distributed RRAM-crossbar. Such a distributed RRAM-crossbar architecture can utilize the reformulated ℓ2 norm solver with a scalable and energy-efficient solution for real-time training and testing in image recognition. Experiment results have shown that significant speedup can be achieved for matrix-vector multiplication in the ℓ2 norm solver such hat the overall training and testing time can be reduced respectively. In addition, large energy saving can be also achieved when compared to the traditional CMOS-based out-of-memory computing architecture.


IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2016

A Zonotoped Macromodeling for Eye-Diagram Verification of High-Speed I/O Links With Jitter and Parameter Variations

Leibin Ni; P D Sai Manoj; Yang Song; Chenjie Gu; Hao Yu

It is challenging to efficiently evaluate the performance bound of high-precision analog circuits with input and parameter variations at nano-scale. With the use of zonotope to model uncertainty of input data pattern (or jitter) and multiple parameters, a reachability-based verification is developed in this paper to compute the worst-case eye-diagram. The proposed zonotope-based reachability analysis can consider both spatial and temporal variations in one-time simulation. Moreover, a nonlinear zonotoped macromodeling is further developed to reduce the computational complexity. Performance bound for I/O links considering the parameter variations are evaluated. In addition, the eye-diagrams are generated by the proposed zonotoped macromodel for performance evaluation considering both temporal and spatial variations. As shown by experiments, the zonotoped macromodel achieves up to 450× speedup compared to the Monte Carlo simulation of the original model within small error under specified macromodel order for high-speed I/O links eye-diagram verification.


Archive | 2017

Distributed In-Memory Computing on Binary Memristor-Crossbar for Machine Learning

Hao Yu; Leibin Ni; Hantao Huang

The recent emerging memristor can provide non-volatile memory storage but also intrinsic computing for matrix-vector multiplication, which is ideal for low-power and high-throughput data analytics accelerator performed in memory. However, the existing memristor-crossbar based computing is mainly assumed as a multi-level analog computing, whose result is sensitive to process non-uniformity as well as additional overhead from AD-conversion and I/O. In this chapter, we explore the matrix-vector multiplication accelerator on a binary memristor-crossbar with adaptive 1-bit-comparator based parallel conversion. Moreover, a distributed in-memory computing architecture is also developed with according control protocol. Both memory array and logic accelerator are implemented on the binary memristor-crossbar, where logic-memory pair can be distributed with protocol of control bus. Experiment results have shown that compared to the analog memristor-crossbar, the proposed binary memristor-crossbar can achieve significant area-saving with better calculation accuracy. Moreover, significant speedup can be achieved for matrix-vector multiplication in the neuron-network based machine learning such that the overall training and testing time can be both reduced respectively. In addition, large energy saving can be also achieved when compared to the traditional CMOS-based out-of-memory computing architecture.


design, automation, and test in europe | 2015

An energy-efficient non-volatile in-memory accelerator for sparse-representation based face recognition

Yuhao Wang; Hantao Huang; Leibin Ni; Hao Yu; Mei Yan; Chuliang Weng; Wei Yang; Junfeng Zhao

Data analytics such as face recognition involves large volume of image data, and hence leads to grand challenge on mobile platform design with strict power requirement. Emerging non-volatile STT-MRAM has the minimum leakage power and comparable speed to SRAM, and hence is considered as a promising candidate for data-oriented mobile computing. However, there exists significantly higher write-energy for STT-MRAM when compared to the SRAM. Based on the use of STT-MRAM, this paper introduces an energy-efficient non-volatile in-memory accelerator for a sparse-representation based face recognition algorithm. We find that by projecting high-dimension image data to much lower dimension, the current scaling for STT-MRAM write operation can be applied aggressively, which leads to significant power reduction yet maintains quality-of-service for face recognition. Specifically, compared to a baseline with SRAM, leakage power and dynamic power are reduced by 91.4% and 79% respectively with only slight compromise on recognition rate.

Collaboration


Dive into the Leibin Ni's collaboration.

Top Co-Authors

Avatar

Hao Yu

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar

Hantao Huang

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar

Yuhao Wang

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Zichuan Liu

Nanyang Technological University

View shared research outputs
Top Co-Authors

Avatar

Mei Yan

Nanyang Technological University

View shared research outputs
Researchain Logo
Decentralizing Knowledge