Bo Zhao
University of Pittsburgh
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Bo Zhao.
international symposium on computer architecture | 2009
Ping Zhou; Bo Zhao; Jun Yang; Youtao Zhang
Using nonvolatile memories in memory hierarchy has been investigated to reduce its energy consumption because nonvolatile memories consume zero leakage power in memory cells. One of the difficulties is, however, that the endurance of most nonvolatile memory technologies is much shorter than the conventional SRAM and DRAM technology. This has limited its usage to only the low levels of a memory hierarchy, e.g., disks, that is far from the CPU. In this paper, we study the use of a new type of nonvolatile memories -- the Phase Change Memory (PCM) as the main memory for a 3D stacked chip. The main challenges we face are the limited PCM endurance, longer access latencies, and higher dynamic power compared to the conventional DRAM technology. We propose techniques to extend the endurance of the PCM to an average of 13 (for MLC PCM cell) to 22 (for SLC PCM) years. We also study the design choices of implementing PCM to achieve the best tradeoff between energy and performance. Our design reduced the total energy of an already low-power DRAM main memory of the same capacity by 65%, and energy-delay2 product by 60%. These results indicate that it is feasible to use PCM technology in place of DRAM in the main memory for better energy efficiency.
international conference on computer aided design | 2009
Ping Zhou; Bo Zhao; Jun Yang; Youtao Zhang
The emerging Spin Torque Transfer memory (STT-RAM) is a promising candidate for future on-chip caches due to STT-RAMs high density, low leakage, long endurance and high access speed. However, one of the major challenges of STT-RAM is its high write current, which is disadvantageous when used as an on-chip cache since the dynamic power generated is too high. In this paper, we propose Early Write Termination (EWT), a novel technique to significantly reduce write energy with no performance penalty. EWT can be implemented with low complexity and low energy overhead. Our evaluation shows that up to 80% of write energy reduction can be achieved through EWT, resulting 33% less total energy consumption, and 34% reduction in ED2. These results indicate that EWT is an effective and practical scheme to improve the energy efficiency of a STT-RAM cache.
international symposium on microarchitecture | 2010
Benjamin C. Lee; Ping Zhou; Jun Yang; Youtao Zhang; Bo Zhao; Engin Ipek; Onur Mutlu; Doug Burger
Phase-change may enable continued scaling of main memories, but PCM has higher access latencies, incurs higher power costs, and wears out more quickly than DRAM. This article discusses how to mitigate these limitations through buffer sizing, row caching, write reduction, and wear leveling, to make PCM a viable dream alternative for scalable main memories.
high performance computer architecture | 2012
Lei Jiang; Bo Zhao; Youtao Zhang; Jun Yang; Bruce R. Childers
Phase change memory (PCM) recently has emerged as a promising technology to meet the fast growing demand for large capacity memory in modern computer systems. In particular, multi-level cell (MLC) PCM that stores multiple bits in a single cell, offers high density with low per-byte fabrication cost. However, despite many advantages, such as good scalability and low leakage, PCM suffers from exceptionally slow write operations, which makes it challenging to be integrated in the memory hierarchy. In this paper, we propose architectural innovations to improve the access time of MLC PCM. Due to cell process variation, composition fluctuation and the relatively small differences among resistance levels, MLC PCM typically employs an iterative write scheme to achieve precise control, which suffers from large write access latency. To address this issue, we propose write truncation (WT) to reduce the number of write iterations with the assistance of an extra error correction code (ECC). We also propose form switch (FS) to reduce the storage overhead of the ECC. By storing highly compressible lines in SLC form, FS improves read latency as well. Our experimental results show that WT and FS improve the effective write/read latency by 57%/28% respectively, and achieve 26% performance improvement over the state of the art.
high-performance computer architecture | 2009
Yi Xu; Yu Du; Bo Zhao; Xiuyi Zhou; Youtao Zhang; Jun Yang
Interconnection plays an important role in performance and power of CMP designs using deep sub-micron technology. The network-on-chip (NoCs) has been proposed as a scalable and high-bandwidth fabric for interconnect design. The advent of the 3D technology has provided further opportunity to reduce on-chip communication delay. However, the design of the 3D NoC topologies has important distinctions from 2D NoCs or off-chip interconnection networks. First, current 3D stacking technology allows only vertical inter-layer links. Hence, there cannot be direct connections between arbitrary nodes in different layers — the vertical connection topology are essentially fixed. Second, the 3D NoC is highly constrained by the complexity and power of routers and links. Hence, low-radix routers are preferred over high-radix routers for lower power and better heat dissipation. This implies long network latency due to high hop counts in network paths. In this paper, we design a low-diameter 3D network using low-radix routers. Our topology leverages long wires to connect remote intra-layer nodes. We take advantage of the start-of-the-art one-hop vertical communication design and utilize lateral long wires to shorten network paths. Effectively, we implement a small-to-medium sized clique network in different layers of a 3D chip. The resulting topology generates a diameter of 3-hop only network, using routers of the same radix as 3D mesh routers. The proposed network shows up to 29% of network latency reduction, up to 10% throughput improvement, and up to 24% energy reduction, when compared to a 3D mesh network.
design automation conference | 2012
Lei Jiang; Bo Zhao; Youtao Zhang; Jun Yang
MLC STT-MRAM (Multi-level Cell Spin-Transfer Torque Magnetic RAM), an emerging non-volatile memory technology, has become a promising candidate to construct L2 caches for high-end embedded processors. However, the long write latency limits the effectiveness of MLC STT-MRAM based L2 caches. In this paper, we address this limitation with two novel designs: Line Pairing (LP) and Line Swapping (LS). LP forms fast cachelines by re-organizing MLC soft bits which are faster to write. LS dynamically stores frequently written data into these fast cachelines. Our experimental results show that LP and LS improve system performance by 15% and reduce energy consumption by 21%.
high-performance computer architecture | 2010
Yi Xu; Bo Zhao; Youtao Zhang; Jun Yang
Technology scaling has led to the integration of many cores into a single chip. As a result, on-chip interconnection networks start to play a more and more important role in determining the performance and power of the entire chip. Packet-switched network-on-chip (NoC) has provided a scalable solution to the communications for tiled multi-core processors. However the virtual-channel (VC) buffers in the NoC consume significant dynamic and leakage power of the system. To improve the energy efficiency of the router design, it is advantageous to use small buffer sizes while still maintaining throughput of the network. This paper proposes two new virtual channel allocation (VA) mechanisms, termed Fixed VC Assignment with Dynamic VC Allocation (FVADA) and Adjustable VC Assignment with Dynamic VC Allocation (AVADA). The idea is that VCs are assigned based on the designated output port of a packet to reduce the Head-of-Line (HoL) blocking. Also, the number of VCs allocated for each output port can be adjusted dynamically. Unlike previous buffer-pool based designs, we only use a small number of VCs to keep the arbitration latency low. Simulation results show that FVADA and AVADA can improve the network throughput by 41% on average, compared to a baseline design with the same buffer size. AVADA can still outperform the baseline even when our buffer size is halved. Moreover, we are able to achieve comparable or better throughput than a previous dynamic VC allocator while reducing its critical path delay by 60%. Our results prove that the proposed VA mechanisms are suitable for low-power, high-throughput, and high-frequency on-chip network designs.
parallel computing | 2015
Yi Xu; Bo Zhao; Youtao Zhang; Jun Yang
Technology scaling has led to the integration of many cores into a single chip. As a result, on-chip interconnection networks start to play a more and more important role in determining the performance and power of the entire chip. Packet-switched network-on-chip (NoC) has provided a scalable solution to the communications for tiled multi-core processors. However the virtual-channel (VC) buffers in the NoC consume significant dynamic and leakage power of the system. To improve the energy efficiency of the router design, it is advantageous to use small buffer sizes while still maintaining throughput of the network. This paper proposes two new virtual channel allocation (VA) mechanisms, termed Fixed VC Assignment with Dynamic VC Allocation (FVADA) and Adjustable VC Assignment with Dynamic VC Allocation (AVADA). The idea is that VCs are assigned based on the designated output port of a packet to reduce the Head-of-Line (HoL) blocking. Also, the number of VCs allocated for each output port can be adjusted dynamically. Unlike previous buffer-pool based designs, we only use a small number of VCs to keep the arbitration latency low. Simulation results show that FVADA and AVADA can improve the network throughput by 41% on average, compared to a baseline design with the same buffer size. AVADA can still outperform the baseline even when our buffer size is halved. Moreover, we are able to achieve comparable or better throughput than a previous dynamic VC allocator while reducing its critical path delay by 60%. Our results prove that the proposed VA mechanisms are suitable for low-power, high-throughput, and high-frequency on-chip network designs.
international symposium on microarchitecture | 2009
Bo Zhao; Yu Du; Youtao Zhang; Jun Yang
Process variations in integrated circuits have significant impact on their performance, leakage and stability. This is particularly evident in large, regular and dense structures such as DRAMs. DRAMs are built using minimized transistors with presumably uniform speed in an organized array structure. Process variation can introduce latency disparity among different memory arrays. With the proliferation of 3D stacking technology, DRAMs become a favorable choice for stacking on top of a multicore processor as a last level cache for large capacity, high bandwidth, and low power. Hence, variations in bank speed creates a unique problem of non-uniform cache accesses in 3D space. In this paper, we investigate cache management techniques for tolerating process variation in a 3D DRAM stacked onto a multicore processor. We modeled the process variation in a 4-layer DRAM memory to characterize the latency variations among different banks. As a result, the notion of fast and slow banks from the cores standpoint is no longer associated with their physical distances with the banks. They are determined by the different bank latencies due to process variation. We develop cache migration schemes that utilizes fast banks while limiting the cost due to migration. Our experiments show that there is a great performance benefit in exploiting fast memory banks through migration. On average, a variation-aware management can improve the performance of a workload over the baseline (where one of the slowest bank speed is assumed for all banks) by 17.8%. We are also only 0.45% away in performance from an ideal memory where no process variation is present.
IEEE Transactions on Computers | 2014
Ping Zhou; Bo Zhao; Jun Yang; Youtao Zhang
Phase Change Memory (PCM) has emerged as a promising candidate for future memories. PCM has high cell density, zero cell leakage, and high stability in deep sub-micron technologies. Although PCM has limited endurance, recent endeavors have shown that its lifetime can be improved by orders of magnitude. However, a major hurdle for PCM is the long write latency and high write power. For this reason, PCM cannot deliver satisfactory memory bandwidth for high-end computing environment such as multi-processing and server systems. In this paper, we develop a non-blocking PCM bank design such that subsequent reads or writes can be carried in parallel with an on-going write. This is effective in removing long blocking time due to serial operations. Moreover, we propose novel memory request scheduling algorithms to exploit intra-bank parallelism brought by our non-blocking hardware. Our non-blocking hardware with scheduling enhancement improves PCM memory throughput by 51% on average. Finally, we propose a fine-grained power budgeting scheme to achieve more throughput improvement under power budgets. Experiments show that our scheduler enhanced with power budgeting scheme can achieve a throughput improvement of 118% on average.