Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Zhiyi Yu is active.

Publication


Featured researches published by Zhiyi Yu.


IEEE Journal of Solid-state Circuits | 2008

AsAP: An Asynchronous Array of Simple Processors

Zhiyi Yu; Michael J. Meeuwsen; Ryan W. Apperson; Omar Sattari; Michael Lai; Jeremy W. Webb; Eric W. Work; Dean Truong; Tinoosh Mohsenin; Bevan M. Baas

An array of simple programmable processors is implemented in 0.18 mum CMOS and contains 36 asynchronously clocked independent processors. Each processor occupies 0.66 and is fully functional at a clock rate of 520-540 MHz at 1.8 V and over 600 MHz at 2.0 V. Processors dissipate an average of 32 mW under typical conditions at 1.8 V and 475 MHz, and 2.4 mW at 0.9 V and 116 MHz while executing applications such as a JPEG encoder core and a fully compliant IEEE 802.11 a/g wireless LAN baseband transmitter.


symposium on vlsi circuits | 2008

A 167-processor 65 nm computational platform with per-processor dynamic supply voltage and dynamic clock frequency scaling

Dean Truong; Wayne Cheng; Tinoosh Mohsenin; Zhiyi Yu; Toney Jacobson; Gouri Landge; Michael J. Meeuwsen; Christine Watnik; Paul Mejia; Anh T. Tran; Jeremy W. Webb; Eric W. Work; Zhibin Xiao; Bevan M. Baas

A 167-processor 65 nm computational platform well suited for DSP, communication, and multimedia workloads contains 164 programmable processors with dynamic supply voltage and dynamic clock frequency circuits, three algorithm-specific processors, and three 16 KB shared memories, all clocked by independent oscillators and connected by configurable long-distance-capable links.


international solid-state circuits conference | 2006

An asynchronous array of simple processors for dsp applications

Zhiyi Yu; Michael J. Meeuwsen; Ryan W. Apperson; Omar Sattari; Michael A. Lai; Jeremy W. Webb; Eric W. Work; Tinoosh Mohsenin; M. Singh; Bevan M. Baas

An array of simple programmable processors designed for DSP applications is implemented in 0.18mum CMOS and contains 36 asynchronously clocked independent processors. The processors operate at 475MHz, and each processor has a maximum power of 144mW at 1.8V and occupies 0.66 mm2


IEEE Transactions on Very Large Scale Integration Systems | 2007

A Scalable Dual-Clock FIFO for Data Transfers Between Arbitrary and Haltable Clock Domains

Ryan W. Apperson; Zhiyi Yu; Michael J. Meeuwsen; Tinoosh Mohsenin; Bevan M. Baas

A robust, scalable, and power efficient dual-clock first-input first-out (FIFO) architecture which is useful for transferring data between modules operating in different clock domains is presented. The architecture supports correct operation in applications where multiple clock cycles of latency exist between the data producer, FIFO, and the data consumer; and with arbitrary clock frequency changes, halting, and restarting in either or both clock domains. The architecture is demonstrated in both a 0.18- mum CMOS full-custom design and a 0.18-mum CMOS standard cell design used in a globally asynchronous locally synchronous array processor. It achieves 580-MHz operation and 10.3-mW power dissipation while performing simultaneous FIFO read and write operations at 1.8 V.


international solid-state circuits conference | 2013

A 65nm 39GOPS/W 24-core processor with 11Tb/s/W packet-controlled circuit-switched double-layer network-on-chip and heterogeneous execution array

Peng Ou; Jiajie Zhang; Heng Quan; Yi Li; Maofei He; Zheng Yu; Xueqiu Yu; Shile Cui; Jie Feng; Shikai Zhu; Jie Lin; Ming'e Jing; Xiaoyang Zeng; Zhiyi Yu

With the increasing complexity and variety of applications, programmable multi-core processors are drawing attention due to their high flexibility and low implementation cost, yet their performance and energy efficiency still cannot fulfill the demands of many compute-intensive applications. This paper describes a high-performance energy-efficient 24-core processor for multi-media and communication applications, with the following key features: (1) a packet-controlled circuit-switched double-layer network-on-chip (NoC) which provides 11Tb/s/W energy efficiency with 435Gb/s bisection-bandwidth; (2) a cluster-shared NoC-connected heterogeneous reconfigurable execution array, which can improve the performance of frequently used computations in multimedia and communication applications by over 6×; (3) memory hierarchy improvements, including a multi-page foreground and background register file, and memory splitting and sharing. The processor, implemented in TSMC 65nm CMOS LP and occupying 18.8mm2 (Fig. 3.6.7) operates at 850MHz at 1.2V, with 523mW power dissipation and 39GOPS/W (26pJ/operation) energy efficiency, which is 1.75× better than our former 16-core processor [3].


international solid-state circuits conference | 2012

An 800MHz 320mW 16-core processor with message-passing and shared-memory inter-core communication mechanisms

Zhiyi Yu; Kaidi You; Ruijin Xiao; Heng Quan; Peng Ou; Yan Ying; Haofan Yang; Ming'e Jing; Xiaoyang Zeng

Almost all multicore processors use a shared-memory architecture due to its simple programming model. Recently, however, the message-passing mechanism is also drawing attention due to its potentially better scalability. In this work, we demonstrate that a hybrid communication mechanism supporting both message passing and shared memory can provide both higher performance and energy efficiency. This 16-core processor has 3 key features: (1) A cluster-based hierarchical architecture supporting both shared-memory and message-passing communication. (2) A cache-free memory hierarchy with an extended register file, small private memory and moderate shared memory to avoid complex cache coherence issues and achieve high energy efficiency by keeping data accesses local. (3) A hardware-aided mailbox mechanism to accelerate the synchronization procedure between different processor nodes. With these techniques, our multicore processor can provide high performance for many applications. Chip test results show that its maximum clock frequency is 800MHz and typical power consumption is 320mW, when running basic applications with clock gating at 1.2V at room temperature.


international symposium on microarchitecture | 2007

AsAP: A Fine-Grained Many-Core Platform for DSP Applications

Bevan M. Baas; Zhiyi Yu; Michael J. Meeuwsen; Omar Sattari; Ryan W. Apperson; Eric W. Work; Jeremy W. Webb; Michael A. Lai; Tinoosh Mohsenin; Dean Truong; Jason Cheung

Many emerging and future applications require significant levels of complex digital signal processing and operate within limited power budgets. Moreover, dramatically rising VLSI fabrication and design costs make programmable and reconfigurable solutions increasingly attractive. the ASAP project addresses these challenges with a chip multiprocessor composed of simple processors with small memories, achieving high energy efficiency and throughput in a small chip area.


IEEE Transactions on Very Large Scale Integration Systems | 2009

High Performance, Energy Efficiency, and Scalability With GALS Chip Multiprocessors

Zhiyi Yu; Bevan M. Baas

Chip multiprocessors with globally asynchronous locally synchronous (GALS) clocking styles are promising candidates for processing computationally-intensive and energy-constrained workloads. The GALS methodology simplifies clock tree design, provides opportunities to use clock and voltage scaling jointly in system submodules to achieve high energy efficiencies, and can also result in easily scalable clocking systems. However, its use typically also introduces performance penalties due to additional communication latency between clock domains. We show that GALS chip multiprocessors (CMPs) with large inter-processor first-inputs-first-outputs (FIFOs) buffers can inherently hide much of the GALS performance penalty while executing applications that have been mapped with few communication loops. In fact, the penalty can be driven to zero with sufficiently large FIFOs and the removal of multiple-loop communication links. We present an example mesh-connected GALS chip multiprocessor and show it has a less than 1% performance (throughput) reduction on average compared to the corresponding synchronous system for many DSP workloads. Furthermore, adaptive clock and voltage scaling for each processor provides an approximately 40% power savings without any performance reduction. These results compare favorably with the GALS uniprocessor, which compared to the corresponding synchronous uniprocessor, has a reported greater than 10% performance (throughput) reduction and an energy savings of approximately 25% using dynamic clock and voltage scaling for many general purpose applications.


international conference on computer design | 2006

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles

Zhiyi Yu; Bevan M. Baas

This paper investigates implementation techniques for tile-based chip multiprocessors with Globally Asynchronous Locally Synchronous (GALS) clocking styles. These architectures can simplify the physical design flow since they allow focusing on a single processor when designing an entire chip. However, they also introduce challenges to maintain system robustness and scalability. We propose a physical design flow for these architectures, investigate timing issues for robust implementations, and propose methods to take full advantage of their potential scalability. As a design example, we present data from a recently implemented single-chip 6 x 6 tile-based GALS processing array.


IEEE Transactions on Very Large Scale Integration Systems | 2010

A Low-Area Multi-Link Interconnect Architecture for GALS Chip Multiprocessors

Zhiyi Yu; Bevan M. Baas

A new inter-processor communication architecture for chip multiprocessors is proposed which has a low area cost, flexible routing capability, and supports globally asynchronous locally synchronous (GALS) clocking styles. To achieve a low area cost, the proposed statically-configurable asymmetric architecture assigns large buffer resources to only the nearest neighbor interconnect and much smaller buffer resources for long distance interconnect. To maintain flexible routing capability, each neighboring processor pair has multiple connecting links. The architecture supports long distance communication in GALS systems by transferring the source clock with the data signals along the entire path for write synchronization. Compared to a traditional dynamically-configurable interconnect architecture with symmetric buffer allocation and single-links between neighboring processor pairs, this implementation has approximately two times smaller communication circuitry area with a similar routing capability. Area and speed estimates are obtained with the physical design of seven chips in 0.18-¿m CMOS.

Collaboration


Dive into the Zhiyi Yu's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Bevan M. Baas

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge