Xiuyuan Bi
University of Pittsburgh
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xiuyuan Bi.
international symposium on microarchitecture | 2011
Zhenyu Sun; Xiuyuan Bi; Hai Helen Li; Weng-Fai Wong; Zhong-Liang Ong; Xiaochun Zhu; Wenqing Wu
Spin-transfer torque random access memory (STT-RAM) has received increasing attention because of its attractive features: good scalability, zero standby power, non-volatility and radiation hardness. The use of STT-RAM technology in the last level on-chip caches has been proposed as it minimizes cache leakage power with technology scaling down. Furthermore, the cell area of STT-RAM is only 1/9 ∼ 1/3 that of SRAM. This allows for a much larger cache with the same die footprint, improving overall system performance through reducing cache misses. However, deploying STT-RAM technology in L1 caches is challenging because of the long and power-consuming write operations. In this paper, we propose both L1 and lower level cache designs that use STT-RAM. In particular, our designs use STTRAM cells with various data retention time and write performances, made possible by different magnetic tunneling junction (MTJ) designs. For the fast STT-RAM bits with reduced data retention time, a counter controlled dynamic refresh scheme is proposed to maintain the data validity. Our dynamic scheme saves more than 80% refresh energy compared to the simple refresh scheme proposed in previous works. A L1 cache built with ultra low retention STTRAM coupled with our proposed dynamic refresh scheme can achieve 9.2% in performance improvement, and saves up to 30% of the total energy when compared to one that uses traditional SRAM. For lower level caches with relative large cache capacity, we propose a data migration scheme that moves data between portions of the cache with different retention characteristics so as to maximize the performance and power benefits. Our experiments show that on the average, our proposed multi retention level STT-RAM cache reduces 30 ∼ 70% of the total energy compared to previous works, while improving IPC performance for both 2-level and 3-level cache hierarchy.
international conference on computer aided design | 2012
Xiuyuan Bi; Zhenyu Sun; Hai Li; Wenqing Wu
Using the spin-transfer torque random access memory (STT-RAM) technology as lower level on-chip caches has been proposed to minimize leakage power consumption and enhance cache capacity at the scaled technologies. However, programming STT-RAM is a stochastic process due to the random thermal fluctuations. Conventional worst-case (corner) design with a fixed write pulse period cannot completely eliminate the write failures but maintain it at a low level by paying high cost in hardware complexity and system performance. In this work, we systematically study the impacts of the stochastic switching of STT-RAM on circuit and cache performance. Two probabilistic design techniques, write-verify-rewrite with adaptive period (WRAP) and verify-one-while-writing (VOW), then are proposed for performance improvement and write failure reduction. Our simulation results show that compared to the result of the conventional design using Hamming Code to correct the write failures, WRAP is write error free while reducing the cache write latency and energy consumption by 40% and 26%, respectively. When an extremely low write failure rate (i.e., 10-22) is allowed, VOW can further boost the reductions on write latency and energy to 52% and 29%, respectively. Furthermore, a hybrid STT-RAM based cache hierarchy taking advantages of probabilistic design techniques is proposed. The novel hierarchy can reduce the write failure rate of STT-RAM cache to 10-30, while improving the speed by 6.8% and saving 15% of energy consumption compared to a conventional design with Hamming Code.
international conference on computer aided design | 2013
Xiuyuan Bi; Mengjie Mao; Danghui Wang; Hai Li
In this paper, we study the use of multi-level cell (MLC) spin-transfer torque RAM (STT-RAM) in cache design of embedded systems and microprocessors. Compared to the single level cell (SLC) design, a MLC STT-RAM cache is expected to offer higher density and faster system performance. However, the cell design constrains, such as the switching current requirement and asymmetry in write operations, severely limit the density benefit of the conventional MLC STT-RAM. The two-step read/write accesses and inflexible data mapping strategy in the existing MLC STT-RAM cache architecture may even result in system performance degradation. To unleash the real potential of MLC STT-RAM cache, we propose a cross-layer solution. First, we introduce the reverse magnetic junction tunneling (MTJ) into MLC cell design, which offers a more balanced device and design tradeoff and enables 2x storage density than SLC. At architectural level, we propose a cell split mapping method to divide cache lines into fast and slow regions and data migration policies to allocate the frequently-used data to fast regions. Furthermore, an application-aware speed enhancement mode is utilized to adaptively tradeoff cache capacity and speed, satisfying different requirements of various applications. Simulation results show that the proposed techniques can improve the system performance by 10.3% and reduce the energy consumption on cache by 26.0% compared with conventional MLC STT-RAM.
IEEE Transactions on Very Large Scale Integration Systems | 2014
Zhenyu Sun; Xiuyuan Bi; Hai Li; Weng-Fai Wong; Xiaochun Zhu
Spin-transfer torque random access memory (STT-RAM) is the most promising candidate to be universal memory due to its good scalability, zero standby power, and radiation hardness. Having a cell area only 1/9 to 1/3 that of SRAM, allows for a much larger cache with the same die footprint. Such reduction of cell size can significantly shrink the cache array size, leading to significant improvement of overall system performance and power consumption, especially in this multicore era where locality is crucial. However, deploying STT-RAM technology in L1 caches is challenging because write operations on STT-RAM are slow and power-consuming. In this paper, we propose a range of cache hierarchy designs implemented entirely using STT-RAM that deliver optimal power saving and performance. In particular, our designs use STT-RAM cells with various data retention times and write performances, made possible by novel magnetic tunneling junction designs. For L1 caches where speed is of utmost importance, we propose a scheme that uses fast STT-RAM cells with reduced data retention time coupled with a dynamic refresh scheme. In the dynamic refresh scheme, another emerging technology, memristor, is used as the counter to monitor the data retention of the low-retention STT-RAM, achieving a higher array area efficiency than an SRAM-based counter. For lower level caches with relatively larger cache capacities, we propose a design that has partitions of different retention characteristics, and a data migration scheme that moves data between these partitions. The experiments show that on the average, our proposed multiretention level STT-RAM cache reduces total energy by as much as 30%-74.2% compared to previous single retention level STT-RAM caches, while improving instruction per cycle performance for both two-level and three-level cache hierarchies.
international symposium on low power electronics and design | 2012
Zhenyu Sun; Xiuyuan Bi; Hai Li
The spin-transfer torque random access memory (STT-RAM) has gained increasing attentions for its high density, fast read access, zero standby power, and good scalability. The recently proposed retention-relax design further improves STT-RAM write access performance and makes it even more promising as an on-chip memory technology. Nevertheless, the process variations could affect the writability of STT-RAM cells. The situation for retention-relax design is even more severe. In this paper, we comprehensively study the impact of process variations, including those from both CMOS and magnetic technologies, on key STT-RAM design parameters. Furthermore, we propose process variation aware nonuniform cache access (PVA-NUCA) technique for large STT-RAM cache design. Besides the varying interconnect latencies determined by memory locations, PVA-NUCA compensates write time variations of STT-RAM cells resulted by process variations. Two algorithms, namely, conservative promotion and aggressive prediction, have been introduced and evaluated. A conflict-reduction mechanism is utilized to degrade the data access miss rate caused by conflicts of access-intensive data blocks. Compared to the traditional STT-RAM dynamic nonuniform cache access (DNUCA), our proposed dynamic PVA-NUCA can improve 25.29% of IPC performance and reduce 26.4% of STT-RAM cache energy consumption, with < 1% of area overhead.
asia and south pacific design automation conference | 2015
Xiaoxiao Liu; Mengjie Mao; Xiuyuan Bi; Hai Li; Yiran Chen
Modern GPGPUs employ a large register file (RF) to efficiently process heavily parallel threads in single instruction multiple thread (SIMT) fashion. The up-scaling of RF capacity, however, is greatly constrained by large cell area and high leakage power consumption of SRAM implementation. In this work, we propose a novel GPU RF design based on the emerging multi-level cell (MLC) spin-transfer torque RAM (STT-RAM) technology. Compared to SRAM, MLC STT-RAM (or MLC-STT) has much smaller cell area and almost zero standby power due to its non-volatility. Moreover, by leveraging the asymmetric performance of the soft and the hard bits of a MLC-STT cell, we propose a remapping strategy to perform a flexible tradeoff between the access time and the capacity of the RF based on run-time access patterns. A novel rescheduling scheme is also developed to minimize the waiting time of the issued warps to access register banks. Experimental results over ISPASS2009 and CUDA benchmarks show that on average, our proposed MLC-STT RF can achieve 3.28% performance improvement, 9.48% energy reduction, and 38.9% energy efficiency enhancement compared to conventional SRAM-based design.
design, automation, and test in europe | 2013
Xiuyuan Bi; Mohamed Anis Weldon; Hai Li
The spin-transfer torque random access memory (STT-RAM) has been widely investigated as a promising candidate to replace the static random access memory (SRAM) as on-chip cache memories. However, the existing STT-RAM cell designs can be used for only single-port accesses, which limits the memory access bandwidth and constraints the system performance. In this work, we propose the design solutions to provide dual-port accesses for STT-RAM. The area increment by introducing an additional port is reduced by leveraging the shared source-line structure. Detailed analysis on the performance/reliability degradation caused by dual-port accesses and the corresponding design optimization are performed. We propose two types of dual-port STT-RAM cell structures having 2 read/write ports (2RW) or 1-read/1-write port (1R/1W), respectively. Comparison shows that a 2RW STT-RAM cell consumes only 42% of area of a dual-port SRAM. The 1R/1W design further reduces 7.7% of cell area under same performance target.
IEEE Transactions on Magnetics | 2012
Xiuyuan Bi; Hai Li; Xiaobin Wang
In spin-transfer torque random access memory (STT-RAM), the temperature fluctuations can significantly affect the characteristics of both electrical and magnetic devices. In this paper, we analyze their temperature dependence and investigate the impacts of temperature fluctuations on read and write operations. For the regular STT-RAM design working under 300 to 375 K, the sense margin in read operation can degrade about 25% and the write speed reduces around 20%. Furthermore, heating the MTJ to its Curie temperature can dramatically reduce the switching time to subnanosecond. And the design feasibility and implication are discussed in the paper.
international symposium on low power electronics and design | 2014
Zhenyu Sun; Xiuyuan Bi; Hai Li
The recent successful integration of magnetic racetrack memory forecasts a new computing era with unprecedentedly high-density on-chip storage. However, racetrack memory accesses require frequent magnetic domain shifting, introducing overheads in access latency and energy consumption. In this paper, we evaluate and compare several different physical layout strategies and array organizations. From this evaluation, a workload-oriented racetrack LLC architecture is proposed that combines different array types, each of which is tailored to a specific data access pattern. Further, a resizable cache access strategy is applied to reduce shifting overheads at runtime. Our simulation results show that compared with the leading racetrack-based cache, the proposed racetrack LLC can improve system performance by 13.2% reduce LLC energy consumption by 30.4%.
IEEE Transactions on Computers | 2016
Zhenyu Sun; Xiuyuan Bi; Wenqing Wu; Sungjoo Yoo; Hai Helen Li
As the descendant of spin-transfer random access memory (STT-RAM), racetrack memory technology saves data in magnetic domains along nanoscopic wires. Such a unique structure can achieve unprecedentedly high storage density meanwhile inheriting the promising features of STT-RAM, such as fast access speed, non-volatility, zero standby power, hardness to soft errors, and compatibility with CMOS technology. Moreover, the recent success in planar racetrack nanowire promised its fabrication feasibility and continuous scalability. In this paper, we investigate the design and optimization of racetrack memory as last-level cache by embracing design considerations across multiple abstraction layers, including the cell design, the array structure, the architecture organization, and the data management. The cross-layer optimization makes racetrack memory based last-level cache achieve 6.4 × reduction in area, 25 percent enhancement in system performance, and 62 percent saving in energy consumption, compared to STT-RAM cache design. Its benefit over SRAM technology is even more significant.