Is this you? Create Your Porfile

Jia Zhan

University of California, Santa Barbara

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jia Zhan is active.

Explore More

Publication

Featured researches published by Jia Zhan.

design automation conference | 2014

NoC-Sprinting: Interconnect for Fine-Grained Sprinting in the Dark Silicon Era

Jia Zhan; Yuan Xie; Guangyu Sun

The rise of utilization wall limits the number of transistors that can be powered on in a single chip and results in a large region of dark silicon. While such phenomenon has led to disruptive innovation in computation, little work has been done for the Network-on-Chip (NoC) design. NoC not only directly influences the overall multi-core performance, but also consumes a significant portion of the total chip power. In this paper, we first reveal challenges and opportunities of designing power-efficient NoC in the dark silicon era. Then we propose NoC-Sprinting: based on the workload characteristics, it explores fine-grained sprinting that allows a chip to flexibly activate dark cores for instantaneous throughput improvement. In addition, it investigates topological/routing support and thermal-aware floorplanning for the sprinting process. Moreover, it builds an efficient network power-management scheme that can mitigate the dark silicon problems. Experiments on performance, power, and thermal analysis show that NoC-sprinting can provide tremendous speedup, increase sprinting duration, and meanwhile reduce the chip power significantly.

design automation conference | 2013

Designing energy-efficient NoC for real-time embedded systems through slack optimization

Jia Zhan; Nikolay Stoimenov; Jin Ouyang; Lothar Thiele; Vijaykrishnan Narayanan; Yuan Xie

Hard real-time embedded systems impose a strict latency requirement on interconnection subsystems. In the case of network-on-chip (NoC), this means each packet of a traffic stream has to be delivered within a time interval. In addition, with the increasing complexity of NoC, it consumes a significant portion of total chip power, which boosts the power footprint of such chips. In this work, we propose a methodology to minimize the energy consumption of NoC without violating the prespecified latency deadlines of real-time applications. First, we develop a formal approach based on network calculus to obtain the worst-case delay bound of all packets, from which we derive a safe estimate of the number of cycles that a packet can be further delayed in the network without violating its deadline- the worst-case slack. With this information, we then develop an optimization algorithm that trades the slacks for lower NoC energy. Our algorithm recognizes the distribution of slacks for different traffic streams, and assigns different voltages and frequencies to different routers to achieve NoC energy-efficiency, while meeting the deadlines for all packets.

design automation conference | 2015

DimNoC: a dim silicon approach towards power-efficient on-chip network

Jia Zhan; Jin Ouyang; Fen Ge; Jishen Zhao; Yuan Xie

The diminishing momentum of Dennard scaling leads to the ever increasing power density of integrated circuits, and a decreasing portion of transistors on a chip that can be switched on simultaneously-a problem recently discovered and known as dark silicon. There has been innovative work to address the “dark silicon” problem in the fields of power-efficient core and cache system. However, dark silicon challenges with Network-on-Chip (NoC) are largely unexplored. To address this issue, we propose DimNoC, a “dim silicon” approach, which leverages drowsy SRAM and STT-RAM technologies to replace pure SRAM-based NoC buffers. Specifically, we propose two novel hybrid buffer architectures: 1) a Hierarchical Buffer (HB) architecture, which divides the input buffers into a hierarchy of levels with different memory technologies operating at various power states; 2) a Banked Buffer (BB) architecture, which organizes drowsy SRAM and STT-RAM into separate banks in order to hide the long write-latency of STT-RAM. Our experiments show that the proposed DimNoC can achieve 30.9% network energy saving, 20.3% energy-delay product (EDP) reduction, and 7.6% router area decrease compared with the baseline SRAM-based NoC design.

design automation conference | 2015

Core vs. uncore: the heart of darkness

Hsiang-Yun Cheng; Jia Zhan; Jishen Zhao; Yuan Xie; Jack Sampson; Mary Jane Irwin

Even though Moores Law continues to provide increasing transistor counts, the rise of the utilization wall limits the number of transistors that can be powered on and results in a large region of dark silicon. Prior studies have proposed energy-efficient core designs to address the “dark silico” problem. Nevertheless, the research for addressing dark silicon challenges in uncore components, such as shared cache, on-chip interconnect, etc, that contribute significant on-chip power consumption is largely unexplored. In this paper, we first illustrate that the power consumption of uncore components cannot be ignored to meet the chips power constraint. We then introduce techniques to design energy-efficient uncore components, including shared cache and on-chip interconnect. The design challenges and opportunities to exploit 3D techniques and non-volatile memory (NVM) in dark-silicon-aware architecture are also discussed.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2014

Optimizing the NoC Slack Through Voltage and Frequency Scaling in Hard Real-Time Embedded Systems

Jia Zhan; Nikolay Stoimenov; Jin Ouyang; Lothar Thiele; Vijaykrishnan Narayanan; Yuan Xie

Hard real-time embedded systems impose a strict latency requirement on interconnection subsystems. In the case of network-on-chip (NoC), this means each packet of a traffic stream has to be delivered within a time interval. In addition, with the increasing complexity of NoC, it consumes a significant portion of total chip power, which boosts the power footprint of such chips. In this paper, we propose a methodology to minimize the energy consumption of NoC without violating the prespecified latency deadlines of real-time applications. First, we develop a formal approach based on network calculus to obtain the worst-case delay bound of all packets, from which we derive a safe estimate of the number of cycles that a packet can be further delayed in the network without violating its deadline-the worst-case slack. With this information, we then develop an optimization algorithm that trades the slacks for lower NoC energy. Our algorithm recognizes the distribution of slacks for different traffic streams, and assigns different voltages and frequencies to different routers to achieve NoC energy-efficiency, while meeting the deadlines for all packets. Furthermore, we design a feedback-control strategy to enable dynamic frequency and voltage scaling on the network routers in conjunction with the energy optimization algorithm. It can flexibly improve the energy-efficiency of the overall network in response to sporadic traffic patterns at runtime.

asia and south pacific design automation conference | 2014

NoΔ: Leveraging delta compression for end-to-end memory access in NoC based multicores

Jia Zhan; Matthew Poremba; Yi Xu; Yuan Xie

As the number of on-chip processing elements increases, the interconnection backbone bears bursty traffic from memory and cache accesses. In this paper, we propose a compression technique called NoΔ, which leverages delta compression to compress network traffic. Specifically, it conducts data encoding prior to packet injection and decoding before ejection in the network interface. The key idea of NoΔ is to store a data packet in the Network-on-Chip as a common base value plus an array of relative differences (Δ). It can improve the overall network performance and achieve energy savings because of the decreased network load. Moreover, this scheme does not require modifications of the cache storage design and can be seamlessly integrated with any optimization techniques for the on-chip interconnect. Our experiments reveal that the proposed NoΔ incurs negligible hardware overhead and outperforms state-of-the-art zero-content compression and frequent-value compression.

international symposium on microarchitecture | 2016

A unified memory network architecture for in-memory computing in commodity servers

Jia Zhan; Itir Akgun; Jishen Zhao; Al Davis; Paolo Faraboschi; Yuangang Wang; Yuan Xie

In-memory computing is emerging as a promising paradigm in commodity servers to accelerate data-intensive processing by striving to keep the entire dataset in DRAM. To address the tremendous pressure on the main memory system, discrete memory modules can be networked together to form a memory pool, enabled by recent trends towards richer memory interfaces (e.g. Hybrid Memory Cubes, or HMCs). Such an inter-memory network provides a scalable fabric to expand memory capacity, but still suffers from long multi-hop latency, limited bandwidth, and high power consumption - problems that will continue to exacerbate as the gap between interconnect and transistor performance grows. Moreover, inside each memory module, an intra-memory network (NoC) is typically employed to connect different memory partitions. Without careful design, the back-pressure inside the memory modules can further propagate to the inter-memory network to cause a performance bottleneck. To address these problems, we propose co-optimization of intra- and inter-memory network. First, we re-organize the intra-memory network structure, and provide a smart I/O interface to reuse the intra-memory NoC as the network switches for inter-memory communication, thus forming a unified memory network. Based on this architecture, we further optimize the inter-memory network for both high performance and lower energy, including a distance-aware selective compression scheme to drastically reduce communication burden, and a light-weight power-gating algorithm to turn off under-utilized links while guaranteeing a connected graph and deadlock-free routing. We develop an event-driven simulator to model our proposed architectures. Experiment results based on both synthetic traffic and real big-data workloads show that our unified memory network architecture can achieve 75.1% average memory access latency reduction and 22.1% total memory energy saving.

international conference on computer design | 2016

Scalable memory fabric for silicon interposer-based multi-core systems

Itir Akgun; Jia Zhan; Yuangang Wang; Yuan Xie

Three-dimensional (3D) integration is considered as a solution to overcome capacity, bandwidth, and performance limitations of memories. However, due to thermal challenges and cost issues, industry embraced 2.5D implementation for integrating die-stacked memories with large-scale designs, which is enabled by silicon interposer technology that integrates processors and multiple modules of 3D-stacked memories in the same package. Previous work has adopted Network-on-Chip (NoC) concepts for the communication fabric of 3D designs, but the design of a scalable processor-memory interconnect for 2.5D integration remains elusive. Therefore, in this work, we first explore different network topologies for integrating CPUs and memories in a silicon interposer-based multi-core system and reveal that simple point-to-point connections cannot reach the full potential of the memory performance due to bandwidth limitations, especially as more and more memory modules are needed to enable emerging applications with high memory capacity and bandwidth demand, such as in-memory computing. To overcome this scaling problem, we propose a memory network design to directly connect all the memory modules, utilizing the existing routing resource of silicon interposers in 2.5D designs. Observing the unique network traffic in our design, we present a design space exploration that evaluates network topologies and routing algorithms, taking process node and interposer technology design decisions into account. We implement an event-driven simulator to evaluate our proposed memory network in silicon interposer (MemNiSI) design with synthetic traffic as well as real in-memory computing workloads. Our experimental results show that compared to baseline designs, MemNiSI topology reduces the average packet latency by up to 15.3% and Choose Fastest Path (CFP) algorithm further reduces by up to 8.0%. Our scheme can utilize the potential of integrated stacked memory effectively while providing better scalability and infrastructure for large-scale silicon interposer-based 2.5D designs.

IEEE Transactions on Very Large Scale Integration Systems | 2016

Hybrid Drowsy SRAM and STT-RAM Buffer Designs for Dark-Silicon-Aware NoC

Jia Zhan; Jin Ouyang; Fen Ge; Jishen Zhao; Yuan Xie

The breakdown of Dennard scaling prevents us from powering all transistors simultaneously, leaving a large fraction of dark silicon. This crisis has led to innovative work on power-efficient core and memory architecture designs. However, the research for addressing dark silicon challenges with network-on-chip (NoC), which is a major contributor to the total chip power consumption, is largely unexplored. In this paper, we comprehensively examine the network power consumers and the drawbacks of the conventional power-gating techniques. To overcome the dark silicon issue from the NoCs perspective, we propose DimNoC, a dim silicon scheme, which leverages recent drowsy SRAM design and spin-transfer torque RAM (STT-RAM) technology to replace pure SRAM-based NoC buffers. In particular, we propose two novel hybrid buffer architectures: 1) a hierarchical buffer architecture, which divides the input buffers into a set of levels with different power states and 2) a banked buffer architecture, which organizes the drowsy SRAM and the STT-RAM in different banks, and accesses them in an interleaved fashion to hide the long write latency of STT-RAM. In addition, our hybrid buffer design enables NoC data retention mechanism by storing packets in drowsy SRAM and nonvolatile STT-RAM in a lossless manner. Combined with flow control schemes, the NoC data retention mechanism can improve network performance and power simultaneously. Our experiments over real workloads show that DimNoC can achieve 30.9% network energy saving, 20.3% energy-delay product reduction, and 7.6% router area reduction compared with pure SRAM-based NoC design.

ieee international d systems integration conference | 2014

Designing vertical bandwidth reconfigurable 3D NoCs for many core systems

Qiaosha Zou; Jia Zhan; Fen Ge; Matt Poremba; Yuan Xie

As the number of processing elements increases in a single chip, the interconnect backbone becomes more and more stressed when serving frequent memory and cache accesses. Network-on-Chip (NoC) has emerged as a potential solution to provide a flexible and scalable interconnect in a planar platform. In the mean time, three-dimensional (3D) integration technology pushes circuit design beyond Moores law and provides short vertical connections between different layers. As a result, the innovative solution that combines 3D integrations and NoC designs can further enhance the system performance. However, due to the unpredictable workload characteristics, NoC may suffer from intermittent congestions and channel overflows, especially when the network bandwidth is limited by the area and energy budget. In this work, we explore the performance bottlenecks in 3D NoC, and then leverage redundant TSVs, which are conventionally used for fault tolerance only, as vertical links to provide additional channel bandwidth for instant throughput improvement. Moreover, these shared redundant links can be dynamically allocated to the stressed routers for congestion alleviation. Experimental results show that our proposed NoC design can provide up to 40% performance improvement, with less than 1.5% area overhead.

Explore More