Mengjie Mao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mengjie Mao is active.

Explore More

Publication

Featured researches published by Mengjie Mao.

design automation conference | 2015

RENO: a high-efficient reconfigurable neuromorphic computing accelerator design

Xiaoxiao Liu; Mengjie Mao; Beiye Liu; Hai Li; Yiran Chen; Boxun Li; Yu Wang; Hao Jiang; Mark Barnell; Qing Wu; Jianhua Yang

Neuromorphic computing is recently gaining significant attention as a promising candidate to conquer the well-known von Neumann bottleneck. In this work, we propose RENO - a efficient reconfigurable neuromorphic computing accelerator. RENO leverages the extremely efficient mixed-signal computation capability of memristor-based crossbar (MBC) arrays to speedup the executions of artificial neural networks (ANNs). The hierarchically arranged MBC arrays can be configured to a variety of ANN topologies through a mixed-signal interconnection network (M-Net). Simulation results on seven ANN applications show that compared to the baseline general-purpose processor, RENO can achieve on average 178.4× (27.06×) performance speedup and 184.2× (25.23×) energy savings in high-efficient multilayer perception (high-accurate auto-associative memory) implementation. Moreover, in the comparison to a pure digital neural processing unit (D-NPU) and a design with MBC arrays co-operating through a digital interconnection network, RENO still achieves the fastest execution time and the lowest energy consumption with similar computation accuracy.

design automation conference | 2014

Exploration of GPGPU Register File Architecture Using Domain-wall-shift-write based Racetrack Memory

Mengjie Mao; Wujie Wen; Yaojun Zhang; Yiran Chen; Hai Helen Li

SRAM based register file (RF) is one of the major factors limiting the scaling of GPGPU. In this work, we propose to use the emerging nonvolatile domain-wall-shift-write based race-track memory (DWSW-RM) to implement a power-efficient GPGPU RF, of which the power consumption is substantially reduced. A holistic technology set is developed to minimize the high access cost of DWSW-RW caused by the sequential access mechanism. Experiment results show that our proposed techniques can improve the GPGPU performance by 4.6% compared to the baseline with SRAM based RF. The RF energy efficiency is also significantly improved by 2.45×.

international conference on computer aided design | 2013

Unleashing the potential of MLC STT-RAM caches

Xiuyuan Bi; Mengjie Mao; Danghui Wang; Hai Li

In this paper, we study the use of multi-level cell (MLC) spin-transfer torque RAM (STT-RAM) in cache design of embedded systems and microprocessors. Compared to the single level cell (SLC) design, a MLC STT-RAM cache is expected to offer higher density and faster system performance. However, the cell design constrains, such as the switching current requirement and asymmetry in write operations, severely limit the density benefit of the conventional MLC STT-RAM. The two-step read/write accesses and inflexible data mapping strategy in the existing MLC STT-RAM cache architecture may even result in system performance degradation. To unleash the real potential of MLC STT-RAM cache, we propose a cross-layer solution. First, we introduce the reverse magnetic junction tunneling (MTJ) into MLC cell design, which offers a more balanced device and design tradeoff and enables 2x storage density than SLC. At architectural level, we propose a cell split mapping method to divide cache lines into fast and slow regions and data migration policies to allocate the frequently-used data to fast regions. Furthermore, an application-aware speed enhancement mode is utilized to adaptively tradeoff cache capacity and speed, satisfying different requirements of various applications. Simulation results show that the proposed techniques can improve the system performance by 10.3% and reduce the energy consumption on cache by 26.0% compared with conventional MLC STT-RAM.

international conference on computer aided design | 2013

CD-ECC: content-dependent error correction codes for combating asymmetric nonvolatile memory operation errors

Wujie Wen; Mengjie Mao; Xiaochun Zhu; Seung H. Kang; Danghui Wang; Yiran Chen

The write operation asymmetry of many memory technologies causes different write failure rates at 0 →1 and 1 → 0 bit-flippings. Conventional error correction codes (ECCs) spend the same efforts on both bit-flipping directions, leading to very unbalanced write reliability enchantment over different bit-flipping distributions of codewords (i.e., the number of 0 →1 or 1 → 0 bit-flippings). In this work, we developed an analytic asymmetric write channel (AWC) model to analyze the asymmetric write errors in spin-transfer torque random access memory (STT-RAM) designs. A new ECC design concept, namely, content-dependent ECC (CD-ECC), is proposed to achieve balanced error correction at both bit-flipping directions. Two CD-ECC schemes - typical-corner-ECC (TCE) and worst-corner-ECC (WCE), are designed for the codewords with different bit-flipping distributions. Our simulation results show that compared to the common ECC schemes utilized in embedded applications like Hamming code, CD-ECCs can improve the STT-RAM write reliability by 10 - 30x with low hardware overhead and very marginal impact on system performance.

design automation conference | 2014

State-Restrict MLC STT-RAM Designs for High-Reliable High-Performance Memory System

Wujie Wen; Yaojun Zhang; Mengjie Mao; Yiran Chen

Multi-level Cell Spin-Transfer Torque Random AccessMemory (MLC STT-RAM) is a promising nonvolatile memory technology for high-capacity and high-performance applications. However, the reliability concerns and the complicated access mechanism greatly hinder the application of MLC STT-RAM. In this work, we develop a holistic solution set, namely, state-restrict MLC STT-RAM (SR-MLC STT-RAM) to improve the data integrity and performance of MLC STT-RAM with the minimized information density degradation. Three techniques: state restriction (StatRes), error pattern removal (ErrPR), and ternary coding (TerCode) are proposed at circuit level to reduce the read and write errors of MLC STT-RAMcells. State pre-recovery (PreREC) technique is also developed at architecture level to improve the access performance of SR-MLC STT-RAM by eliminating unnecessary two-step write operations. Our simulations show that compared to conventional MLC STT-RAM, SR-MLC STT-RAM can enhance the write and read reliability of memory cells by 10 - 10000×, allowing the application of simple error correction code schemes. Compared to single-level-cell (SLC) STT-RAM, SR-MLC STT-RAM based cache design can boost the system performance by 6.2% on average by leveraging the increased cache capacity at the same area and the improved write latency.

great lakes symposium on vlsi | 2013

Coordinating prefetching and STT-RAM based last-level cache management for multicore systems

Mengjie Mao; Hai Helen Li; Yiran Chen

Data prefetching is a common mechanism to mitigate the bottleneck of off-chip memory bandwidth in modern computing systems. Unfortunately, the side effects of prefetching are an additional burden on off-chip communication and increased cache write operations. With the proposal of spin-transfer torque random access memory (STT-RAM) based last-level caches (LLCs) for their high density and low power consumption, the increase of write pressure to the cache from prefetching coupled with the characteristically long write access compared with traditional SRAM caches exacerbates the performance cost of prefetching schemes. In this work, we propose two orthogonal techniques to reduce the negative performance impact induced by aggressive prefetching on multicore systems employing STT-RAM based LLC. First, basic priority assignment prioritizes the different types of access requests of LLC by their criticality and responds to them based on priority. Second, priority boosting differentiates requests by application and prioritizes the relatively few requests from applications with non-intensive accesses to the LLC, which usually creates the most severe performance degradation in multi-core systems. Combining these two prioritization policies can alleviate the negative effect induced by aggressive prefetching. Our results show that these techniques can achieve an 8.3 average application speedup compared to a baseline, prefetch only design without prioritization.

asia and south pacific design automation conference | 2015

An efficient STT-RAM-based register file in GPU architectures

Xiaoxiao Liu; Mengjie Mao; Xiuyuan Bi; Hai Li; Yiran Chen

Modern GPGPUs employ a large register file (RF) to efficiently process heavily parallel threads in single instruction multiple thread (SIMT) fashion. The up-scaling of RF capacity, however, is greatly constrained by large cell area and high leakage power consumption of SRAM implementation. In this work, we propose a novel GPU RF design based on the emerging multi-level cell (MLC) spin-transfer torque RAM (STT-RAM) technology. Compared to SRAM, MLC STT-RAM (or MLC-STT) has much smaller cell area and almost zero standby power due to its non-volatility. Moreover, by leveraging the asymmetric performance of the soft and the hard bits of a MLC-STT cell, we propose a remapping strategy to perform a flexible tradeoff between the access time and the capacity of the RF based on run-time access patterns. A novel rescheduling scheme is also developed to minimize the waiting time of the issued warps to access register banks. Experimental results over ISPASS2009 and CUDA benchmarks show that on average, our proposed MLC-STT RF can achieve 3.28% performance improvement, 9.48% energy reduction, and 38.9% energy efficiency enhancement compared to conventional SRAM-based design.

ieee high performance extreme computing conference | 2014

A heterogeneous computing system with memristor-based neuromorphic accelerators

Xiaoxiao Liu; Mengjie Mao; Hai Li; Yiran Chen; Hao Jiang; Jianhua Yang; Qing Wu; Mark Barnell

As technology scales, on-chip heterogeneous architecture emerges as a promising solution to combat the power wall of microprocessors. In this work, we propose a heterogeneous computing system with memristor-based neuromorphic computing accelerators (NCAs). In the proposed system, NCA is designed to speed up the artificial neural network (ANN) executions in many high-performance applications by leveraging the extremely efficient mixed-signal computation capability of nanoscale memristor-based crossbar (MBC) arrays. The hierarchical MBC arrays of the NCA can be flexibly configured to different ANN topologies through the help of an analog Network-on-Chip (A-NoC). A general approach which translates the target codes within a program to the corresponding NCA instructions is also developed to facilitate the utilization of the NCA. Our simulation results show that compared to the baseline general purpose processor, the proposed system can achieve on average 18.2X performance speedup and 20.1X energy reduction over nine representative applications. The computation accuracy degradation is constrained within an acceptable range (e.g., 11%), by considering the limited data precision, realistic device variations and analog signal fluctuations.

IEEE Transactions on Circuits and Systems | 2016

Harmonica: A Framework of Heterogeneous Computing Systems With Memristor-Based Neuromorphic Computing Accelerators

Xiaoxiao Liu; Mengjie Mao; Beiye Liu; Boxun Li; Yu Wang; Hao Jiang; Mark Barnell; Qing Wu; Jianhua Yang; Hai Li; Yiran Chen

Following technology scaling, on-chip heterogeneous architecture emerges as a promising solution to combat the power wall of microprocessors. This work presents Harmonica-a framework of heterogeneous computing system enhanced by memristor-based neuromorphic computing accelerators (NCAs). In Harmonica, a conventional pipeline is augmented with a NCA which is designed to speedup artificial neural network (ANN) relevant executions by leveraging the extremely efficient mixed-signal computation capability of nanoscale memristor-based crossbar (MBC) arrays. With the help of a mixed-signal interconnection network (M-Net), the hierarchically arranged MBC arrays can accelerate the computation of a variety of ANNs. Moreover, an inline calibration scheme is proposed to ensure the computation accuracy degradation incurred by the memristor resistance shifting within an acceptable range during NCA executions. Compared to general-purpose processor, Harmonica can achieve on average 27.06 × performance speedup and 25.23 × energy savings when the NCA is configured with auto-associative memory (AAM) implementation. If the NCA is configured with multilayer perception (MLP) implementation, the performance speedup and energy savings can be boosted to 178.41 × and 184.24 ×, respectively, with slightly degraded computation accuracy. Moreover, the performance and power efficiency of Harmonica are superior to the designs with either digital neural processing units (D-NPUs) or MBC arrays cooperating with a digital interconnection network. Compared to the baseline of general-purpose processor, the classification rate degradation of Harmonica in MLP or AAM is less than 8% or 4%, respectively.

design, automation, and test in europe | 2016

A holistic tri-region MLC STT-RAM design with combined performance, energy, and reliability optimizations

Wujie Wen; Mengjie Mao; Hai Li; Yiran Chen; Yukui Pei; Ning Ge

Multi-level cell spin-transfer torque random access memory (MLC STT-RAM) demonstrates great potentials in onchip cache design for its high storage density and non-volatility but also suffers from the degraded access time, reliability and energy efficiency. The existing MLC STT-RAM cache designs primarily focus on the performance and energy optimizations, however, often ignore the crucial demand for reliability. In this work, we propose a tri-region MLC STT-RAM cache design (TMSC) to simultaneously meet the requirements of performance, energy, and reliability. The tri-region MLC STT-RAM cache is optimized partitioned into fast, mixed, and slow ways according to different access performance, energy and reliability. A new error correction code (ECC) scheme, namely, non-uniform strength ECC (NUS-ECC), is also developed to tolerate the different bit failure rates in these ways. Compared to the latest performance-driven MLC STT-RAM cache design with pessimistic ECC scheme, our TMSC technique can improve the system performance and energy by averagely 9.3% and 9.4%, respectively, for various applications. The additional area cost associated with NUS-ECC is limited by 3.2% compared to the pessimistic ECC scheme.

Explore More