Junfeng Zhao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Junfeng Zhao is active.

Explore More

Publication

Featured researches published by Junfeng Zhao.

IEEE Transactions on Nanotechnology | 2015

An Energy-Efficient Nonvolatile In-Memory Computing Architecture for Extreme Learning Machine by Domain-Wall Nanowire Devices

Yuhao Wang; Hao Yu; Leibin Ni; Guang-Bin Huang; Mei Yan; Chuliang Weng; Wei Yang; Junfeng Zhao

The data-oriented applications have introduced increased demands on memory capacity and bandwidth, which raises the need to rethink the architecture of the current computing platforms. The logic-in-memory architecture is highly promising as future logic-memory integration paradigm for high throughput data-driven applications. From memory technology aspect, as one recently introduced nonvolatile memory device, domain-wall nanowire (or race-track) not only shows potential as future power efficient memory, but also computing capacity by its unique physics of spintronics. This paper explores a novel distributed in-memory computing architecture where most logic functions are executed within the memory, which significantly alleviates the bandwidth congestion issue and improves the energy efficiency. The proposed distributed in-memory computing architecture is purely built by domain-wall nanowire, i.e., both memory and logic are implemented by domain-wall nanowire devices. As a case study, neural network-based image resolution enhancement algorithm, called DW-NN, is examined within the proposed architecture. We show that all operations involved in machine learning on neural network can be mapped to a logic-in-memory architecture by nonvolatile domain-wall nanowire. Domain-wall nanowire-based logic is customized for in machine learning within image data storage. As such, both neural network training and processing can be performed locally within the memory. The experimental results show that the domain-wall memory can reduce 92% leakage power and 16% dynamic power compared to main memory implemented by DRAM; and domain-wall logic can reduce 31% both dynamic and 65% leakage power under the similar performance compared to CMOS transistor-based logic. And system throughput in DW-NN is improved by 11.6x and the energy efficiency is improved by 56x when compared to conventional image processing system.

asia and south pacific design automation conference | 2016

An energy-efficient matrix multiplication accelerator by distributed in-memory computing on binary RRAM crossbar

Leibin Ni; Yuhao Wang; Hao Yu; Wei Yang; Chuliang Weng; Junfeng Zhao

Emerging resistive random-access memory (RRAM) can provide non-volatile memory storage but also intrinsic logic for matrix-vector multiplication, which is ideal for low-power and high-throughput data analytics accelerator performed in memory. However, the existing RRAM-based computing device is mainly assumed on a multi-level analog computing, whose result is sensitive to process non-uniformity as well as additional AD- conversion and I/O overhead. This paper explores the data analytics accelerator on binary RRAM-crossbar. Accordingly, one distributed in-memory computing architecture is proposed with design of according component and control protocol. Both memory array and logic accelerator can be implemented by RRAM-crossbar purely in binary, where logic-memory pairs can be distributed with protocol of control bus. Based on numerical results for fingerprint matching that is mapped on the proposed RRAM-crossbar, the proposed architecture has shown 2.86x faster speed, 154x better energy efficiency, and 100x smaller area when compared to the same design by CMOS-based ASIC.

asia and south pacific design automation conference | 2014

Energy efficient in-memory machine learning for data intensive image-processing by non-volatile domain-wall memory

Hao Yu; Yuhao Wang; Shuai Chen; Wei Fei; Chuliang Weng; Junfeng Zhao; Zhulin Wei

Image processing in conventional logic-memory I/O-integrated systems will incur significant communication congestion at memory I/Os for excessive big image data at exa-scale. This paper explores an in-memory machine learning on neural network architecture by utilizing the newly introduced domain-wall nanowire, called DW-NN. We show that all operations involved in machine learning on neural network can be mapped to a logic-in-memory architecture by non-volatile domain-wall nanowire. Domain-wall nanowire based logic is customized for in machine learning within image data storage. As such, both neural network training and processing can be performed locally within the memory. The experimental results show that system throughput in DW-NN is improved by 11.6x and the energy efficiency is improved by 92x when compared to conventional image processing system.

international symposium on low power electronics and design | 2015

An energy efficient and low cross-talk CMOS sub-THz I/O with surface-wave modulator and interconnect

Yuan Liang; Hao Yu; Junfeng Zhao; Wei Yang; Yuangang Wang

Free-space EM-wave based GHz interconnect has significant loss and crosstalk that cannot be deployed as low-power and dense I/Os for future network-on-chip (NoC) integration of many-core and memory. This paper proposes an energy-efficient and low-crosstalk sub-THz (0.1T-1T) I/O with use of surface-wave based modulator and interconnects in CMOS. By introducing sub-wavelength periodical corrugation structure onto transmission line, the surface-wave is established to propagate signal that is strongly localized on surface of top-layer metal wire, which results in low coupling into lossy substrate and neighboring metal wires. As such, significant power saving and cross-talk reduction can be observed with high communication bandwidth. In addition, a high on/off-ratio surface-wave modulator is also proposed to support on-chip THz communication. As designed in 65nm CMOS, the results have shown that the proposed surface-wave I/O interface achieves 25Gbps data rate and 0.016pJ/bit/mm energy efficiency at 140GHz carrier frequency over 20mm surface-wave channels. They can be placed with 2.4μm channel spacing and a -20dB crosstalk ratio. The surface-wave modulator also achieves significant reduction of radiation loss with 23dB extinction ratio.

international symposium on low power electronics and design | 2015

Optimizing Boolean embedding matrix for compressive sensing in RRAM crossbar

Yuhao Wang; Xin Li; Hao Yu; Leibin Ni; Wei Yang; Chuliang Weng; Junfeng Zhao

The emerging resistive random-access-memory (RRAM) crossbar provides an intrinsic fabric for matrix-vector multiplication, which can be leveraged as power efficient linear embedding hardware for data analytics such as compressive sensing. As the matrix elements are represented by resistance of RRAM cells, it imposes constraints for the embedding matrix due to limited RRAM programming resolution. A random Boolean embedding can be efficiently mapped to the RRAM crossbar but suffers from poor performance. Learning-based embedding matrices can deliver optimized performance but are continuous-valued which prevents it from being mapped to RRAM crossbar structure directly. In this paper, we have proposed one algorithm that can find an optimal Boolean embedding matrix for a given learned real-valued embedding matrix, so that it can be effectively mapped to the RRAM crossbar structure while high performance is preserved. The numerical experiments demonstrate that the proposed optimized Boolean embedding can reduce the embedding distortion by 2.7x, and image recovery error by 2.5x compared to the random Boolean embedding, both mapped on RRAM crossbar. In addition, optimized Boolean embedding on RRAM crossbar exhibits 10x faster speed, 17x better energy efficiency, and three orders of magnitude smaller area with slight accuracy penalty, when compared to the optimized real-valued embedding on CMOS ASIC platform.

asia and south pacific design automation conference | 2015

Heterogeneous architecture design with emerging 3D and non-volatile memory technologies

Qiaosha Zou; Matthew Poremba; Rui He; Wei Yang; Junfeng Zhao; Yuan Xie

Energy becomes the primary concern in nowadays multi-core architecture designs. Moores law predicts that the exponentially increasing number of cores can be packed into a single chip every two years, however, the increasing power density is the obstacle to continuous performance gains. Recent studies show that heterogeneous multi-core is a competitive promising solution to optimize performance per watt. In this paper, different types of heterogeneous architecture are discussed. For each type, current challenges and latest solutions are briefly introduced. Preliminary analyses are performed to illustrate the scalability of the heterogeneous system and the potential benefits towards future application requirements. Moreover, we demonstrate the advantages of leveraging three-dimensional (3D) integration on heterogeneous architectures. With 3D die stacking, disparate technologies can be integrated on the same chip, such as the CMOS logic and emerging non-volatile memory, enabling a new paradigm of architecture design.1

ieee mtt s international microwave workshop series on advanced materials and processes for rf and thz applications | 2015

CMOS sub-THz on-chip communication with SRR modulator and SPP interconnect

Yuan Liang; Hao Yu; Chang Yang; Nan Li; Xiuping Li; Xiong Liu; Junfeng Zhao; Wei Yang; Yuangang Wang

Two novel metamaterial devices including Split Ring Resonator (SRR) modulator and Surface Plasmon Polariton (SPP) interconnect (including SPP T-line and coupler) are proposed with CMOS on-chip integration operated at 140GHz. By introducing sub-wavelength periodical corrugation structure onto T-line, SPP is established to propagate signals with strongly localized surface wave, which results in low crosstalk between two back-to-back placed SPP T-lines. Moreover, by stacking two SRR unit-cells with opposite placement, the SRR based modulator manifests itself as a magnetic metamaterial achieving significant reduction of radiation loss with 23dB extinction ratio at sub-THz. As explored in 65nm CMOS, the proposed surface-wave interconnects and SRR modulator have shown great potential for future sub-THz wireline communication in CMOS.

design, automation, and test in europe | 2015

An energy-efficient non-volatile in-memory accelerator for sparse-representation based face recognition

Yuhao Wang; Hantao Huang; Leibin Ni; Hao Yu; Mei Yan; Chuliang Weng; Wei Yang; Junfeng Zhao

Data analytics such as face recognition involves large volume of image data, and hence leads to grand challenge on mobile platform design with strict power requirement. Emerging non-volatile STT-MRAM has the minimum leakage power and comparable speed to SRAM, and hence is considered as a promising candidate for data-oriented mobile computing. However, there exists significantly higher write-energy for STT-MRAM when compared to the SRAM. Based on the use of STT-MRAM, this paper introduces an energy-efficient non-volatile in-memory accelerator for a sparse-representation based face recognition algorithm. We find that by projecting high-dimension image data to much lower dimension, the current scaling for STT-MRAM write operation can be applied aggressively, which leads to significant power reduction yet maintains quality-of-service for face recognition. Specifically, compared to a baseline with SRAM, leakage power and dynamic power are reduced by 91.4% and 79% respectively with only slight compromise on recognition rate.

international symposium on low power electronics and design | 2014

Intelligent frame refresh for energy-aware display subsystems in mobile devices

Yongbing Huangy; Mingyu Chen; Lixin Zhang; Shihai Xiao; Junfeng Zhao; Zhulin Wei

Frame refreshes, that are used to retain frame images from frame buffers for display subsystems in mobile devices, waste energy and memory bandwidth. In this paper, we propose an intelligent frame refresh mechanism to reduce redundant frame refreshes and useless data accesses to frame buffers, which bridges the semantic gap between frame buffers and frame refreshes, and exploits the knowledge of frame buffers to guide frame refreshes. Based on this mechanism, we introduce two detailed schemes to optimize refreshes by utilizing different information. The flipping-aware frame refresh scheme uses the frame buffer switching operations to detect frame image updates and triggers useful refreshes. The row-level frame refresh scheme supports to refresh only modified rows instead of the whole frame, under the guidance of pixel status information of frame buffers. Our evaluation results show that our proposed mechanism can reduce memory requests by nearly 50% and memory power consumption up to 30%, compared to conventional fixed frame refresh mechanism.

international conference on computer communications and networks | 2014

Dandelion: A locally-high-performance and globally-high-scalability hierarchical data center network.

Binzhang Fu; Sheng Xu; Wentao Bao; Guolong Jiang; Mingyu Chen; Lixin Zhang; Yidong Tao; Rui He; Junfeng Zhao

The increasing customer demand is driving modern data centers to embrace the freely-expandable network architecture. Unfortunately, state-of-the-art freely-expandable networks suffer from either the large granularity of expansion or the prohibitive implementation cost. Furthermore, a recent research showed that data center traffic tends to be highly clustered. Based on above observations, this paper proposes a freely-expandable network architecture, namely the dandelion. Dandelion is a two-level hierarchical network, where the first level aims at “high performance” and the second level aims at “high scalability”. The resulting network has two distinct advantages. First, it could arbitrarily expand with a reasonable granularity. Second, the router architecture is efficient as well as highly scalable since 1) the routing table is significantly compressed and 2) a fixed number of virtual channels per physical channel are required regardless of the network size. Finally, the traffic characteristics of four typical cloud applications are analyzed, and the generated traffic patterns are used to evaluate the proposed network architecture. Simulation results prove that the dandelion is a promising network architecture for future data centers.

Explore More