Youtao Zhang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Youtao Zhang is active.

Explore More

Publication

Featured researches published by Youtao Zhang.

international symposium on computer architecture | 2009

A durable and energy efficient main memory using phase change memory technology

Ping Zhou; Bo Zhao; Jun Yang; Youtao Zhang

Using nonvolatile memories in memory hierarchy has been investigated to reduce its energy consumption because nonvolatile memories consume zero leakage power in memory cells. One of the difficulties is, however, that the endurance of most nonvolatile memory technologies is much shorter than the conventional SRAM and DRAM technology. This has limited its usage to only the low levels of a memory hierarchy, e.g., disks, that is far from the CPU. In this paper, we study the use of a new type of nonvolatile memories -- the Phase Change Memory (PCM) as the main memory for a 3D stacked chip. The main challenges we face are the limited PCM endurance, longer access latencies, and higher dynamic power compared to the conventional DRAM technology. We propose techniques to extend the endurance of the PCM to an average of 13 (for MLC PCM cell) to 22 (for SLC PCM) years. We also study the design choices of implementing PCM to achieve the best tradeoff between energy and performance. Our design reduced the total energy of an already low-power DRAM main memory of the same capacity by 65%, and energy-delay2 product by 60%. These results indicate that it is feasible to use PCM technology in place of DRAM in the main memory for better energy efficiency.

international conference on computer aided design | 2009

Energy reduction for STT-RAM using early write termination

Ping Zhou; Bo Zhao; Jun Yang; Youtao Zhang

The emerging Spin Torque Transfer memory (STT-RAM) is a promising candidate for future on-chip caches due to STT-RAMs high density, low leakage, long endurance and high access speed. However, one of the major challenges of STT-RAM is its high write current, which is disadvantageous when used as an on-chip cache since the dynamic power generated is too high. In this paper, we propose Early Write Termination (EWT), a novel technique to significantly reduce write energy with no performance penalty. EWT can be implemented with low complexity and low energy overhead. Our evaluation shows that up to 80% of write energy reduction can be achieved through EWT, resulting 33% less total energy consumption, and 34% reduction in ED2. These results indicate that EWT is an effective and practical scheme to improve the energy efficiency of a STT-RAM cache.

international symposium on microarchitecture | 2010

Phase-Change Technology and the Future of Main Memory

Benjamin C. Lee; Ping Zhou; Jun Yang; Youtao Zhang; Bo Zhao; Engin Ipek; Onur Mutlu; Doug Burger

Phase-change may enable continued scaling of main memories, but PCM has higher access latencies, incurs higher power costs, and wears out more quickly than DRAM. This article discusses how to mitigate these limitations through buffer sizing, row caching, write reduction, and wear leveling, to make PCM a viable dream alternative for scalable main memories.

international conference on software engineering | 2003

Precise dynamic slicing algorithms

Xiangyu Zhang; Rajiv Gupta; Youtao Zhang

Dynamic slicing algorithms can greatly reduce the debugging effort by focusing the attention of the user on a relevant subset of program statements. In this paper we present the design and evaluation of three precise dynamic slicing algorithms called the full preprocessing (FP), no preprocessing (NP) and limited preprocessing (LP) algorithms. The algorithms differ in the relative timing of constructing the dynamic data dependence graph and its traversal for computing requested dynamic slices. Our experiments show that the LP algorithm is a fast and practical precise slicing algorithm. In fact we show that while precise slices can be orders of magnitude smaller than imprecise dynamic slices, for small number of slicing requests, the LP algorithm is faster than an imprecise dynamic slicing algorithm proposed by Agrawal and Horgan.

architectural support for programming languages and operating systems | 2000

Frequent value locality and value-centric data cache design

Youtao Zhang; Jun Yang; Rajiv Gupta

By studying the behavior of programs in the SPECint95 suite we observed that six out of eight programs exhibit a new kind of value locality, the frequent value locality, according to which a few values appear very frequently in memory locations and are therefore involved in a large fraction of memory accesses. In these six programs ten distinct values occupy over 50% of all memory locations and on an average account for nearly 50% of all memory accesses during program execution. This observation holds for smaller blocks of consecutive memory locations and the set of frequent values remains quite stable over the execution of the program.In the six benchmarks with frequent value locality, on an average 50% of all cache misses occur during the reading or writing of the ten most frequently accessed values. We propose a new data cache structure, the frequent value cache (FVC), which employs a value-centric approach to caching data locations for exploiting the frequent value locality phenomenon. FVC is a small direct-mapped cache which is dedicated to holding only frequently occurring values. The value-centric nature of FVC enables us to store data in a compressed form where the compression is achieved by encoding the frequent values using a few bits. Moreover this simple compression scheme preserves the random access to data values in a cache line.Our experiments demonstrate that by augmenting a direct mapped cache (DMC) with a direct mapped FVC of size no more than 3 Kbytes we can obtain reductions in miss rates ranging from 1% to 68%. In fact we observed that higher reductions in miss rates can he achieved by augmenting a DMC with a small FVC as opposed to doubling the size of DMC for the 124.m88ksim and 134.perl benchmarks.

international conference on hardware/software codesign and system synthesis | 2011

Emerging non-volatile memories: opportunities and challenges

Chun Jason Xue; Guangyu Sun; Youtao Zhang; Jianhua Yang; Yiran Chen; Hai Li

In recent years, non-volatile memory (NVM) technologies have emerged as candidates for future universal memory. N-VMs generally have advantages such as low leakage power, high density, and fast read spead. At the same time, NVM-s also have disadvantages. For example, NVMs often have asymetric read and write speed and energy cost, which poses new challenges when applying NVMs. This paper contains a collection of four contributions, presenting basic introduction on three emerging NVM technologies, their unique characteristics, potential challenges, and new opportunities that they may bring forward in memory systems.

high performance computer architecture | 2012

Improving write operations in MLC phase change memory

Lei Jiang; Bo Zhao; Youtao Zhang; Jun Yang; Bruce R. Childers

Phase change memory (PCM) recently has emerged as a promising technology to meet the fast growing demand for large capacity memory in modern computer systems. In particular, multi-level cell (MLC) PCM that stores multiple bits in a single cell, offers high density with low per-byte fabrication cost. However, despite many advantages, such as good scalability and low leakage, PCM suffers from exceptionally slow write operations, which makes it challenging to be integrated in the memory hierarchy. In this paper, we propose architectural innovations to improve the access time of MLC PCM. Due to cell process variation, composition fluctuation and the relatively small differences among resistance levels, MLC PCM typically employs an iterative write scheme to achieve precise control, which suffers from large write access latency. To address this issue, we propose write truncation (WT) to reduce the number of write iterations with the assistance of an extra error correction code (ECC). We also propose form switch (FS) to reduce the storage overhead of the ECC. By storing highly compressible lines in SLC form, FS improves read latency as well. Our experimental results show that WT and FS improve the effective write/read latency by 57%/28% respectively, and achieve 26% performance improvement over the state of the art.

international symposium on performance analysis of systems and software | 2008

Dynamic Thermal Management through Task Scheduling

Jun Yang; Xiuyi Zhou; Marek Chrobak; Youtao Zhang; Lingling Jin

The evolution of microprocessors has been hindered by their increasing power consumption and the heat generation speed on-die. High temperature impairs the processors reliability and reduces its lifetime. While hardware level dynamic thermal management (DTM) techniques, such as voltage and frequency scaling, can effectively lower the chip temperature when it surpasses the thermal threshold, they inevitably come at the cost of performance degradation. We propose an OS level technique that performs thermal- aware job scheduling to reduce the number of thermal trespasses. Our scheduler reduces the amount of hardware DTMs and achieves higher performance while keeping the temperature low. Our methods leverage the natural discrepancies in thermal behavior among different workloads, and schedule them to keep the chip temperature below a given budget. We develop a heuristic algorithm based on the observation that there is a difference in the resulting temperature when a hot and a cool job are executed in a different order. To evaluate our scheduling algorithms, we developed a lightweight runtime temperature monitor to enable informed scheduling decisions. We have implemented our scheduling algorithm and the entire temperature monitoring framework in the Linux kernel. Our proposed scheduler can remove 10.5-73.6% of the hardware DTMs in various combinations of workloads in a medium thermal environment. As a result, the CPU throughput was improved by up to 7.6% (4.1% on average) even under a severe thermal environment.

IEEE Transactions on Parallel and Distributed Systems | 2010

Thermal-Aware Task Scheduling for 3D Multicore Processors

Xiuyi Zhou; Jun Yang; Yi Xu; Youtao Zhang; Jianhua Zhao

A rising horizon in chip fabrication is the 3D integration technology. It stacks two or more dies vertically with a dense, high-speed interface to increase the device density and reduce the delay of interconnects significantly across the dies. However, a major challenge in 3D technology is the increased power density, which gives rise to the concern of heat dissipation within the processor. High temperatures trigger voltage and frequency throttlings in hardware, which degrade the chip performance. Moreover, high temperatures impair the processors reliability and reduce its lifetime. To alleviate this problem, we propose in this paper an OS-level scheduling algorithm that performs thermal-aware task scheduling on a 3D chip. Our algorithm leverages the inherent thermal variations within and across different tasks, and schedules them to keep the chip temperature low. We observed that vertically adjacent dies have strong thermal correlations and the scheduler should consider them jointly. Compared with other intuitive algorithms such as a Random and a Round-Robin algorithm, our proposed algorithm brings lower peak temperature and average temperature on-chip. Moreover, it can remove, on average, 46 percent of thermal emergency time and result in 5.11 percent (4.78 percent) performance improvement over the base case on thermally homogeneous (heterogeneous) floorplans.

high-performance computer architecture | 2009

A low-radix and low-diameter 3D interconnection network design

Yi Xu; Yu Du; Bo Zhao; Xiuyi Zhou; Youtao Zhang; Jun Yang

Interconnection plays an important role in performance and power of CMP designs using deep sub-micron technology. The network-on-chip (NoCs) has been proposed as a scalable and high-bandwidth fabric for interconnect design. The advent of the 3D technology has provided further opportunity to reduce on-chip communication delay. However, the design of the 3D NoC topologies has important distinctions from 2D NoCs or off-chip interconnection networks. First, current 3D stacking technology allows only vertical inter-layer links. Hence, there cannot be direct connections between arbitrary nodes in different layers — the vertical connection topology are essentially fixed. Second, the 3D NoC is highly constrained by the complexity and power of routers and links. Hence, low-radix routers are preferred over high-radix routers for lower power and better heat dissipation. This implies long network latency due to high hop counts in network paths. In this paper, we design a low-diameter 3D network using low-radix routers. Our topology leverages long wires to connect remote intra-layer nodes. We take advantage of the start-of-the-art one-hop vertical communication design and utilize lateral long wires to shorten network paths. Effectively, we implement a small-to-medium sized clique network in different layers of a 3D chip. The resulting topology generates a diameter of 3-hop only network, using routers of the same radix as 3D mesh routers. The proposed network shows up to 29% of network latency reduction, up to 10% throughput improvement, and up to 24% energy reduction, when compared to a 3D mesh network.

Explore More