Shouyi Yin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shouyi Yin is active.

Explore More

Publication

Featured researches published by Shouyi Yin.

design automation conference | 2013

Polyhedral model based mapping optimization of loop nests for CGRAs

Dajiang Liu; Shouyi Yin; Leibo Liu; Shaojun Wei

The coarse-grained reconfigurable architecture (CGRA) is a promising platform that provides both high performance and high power-efficiency. The compute-intensive portions of an application (e.g. loops) are often mapped onto CGRA for acceleration. To optimize the mapping of loop nests to CGRA, this paper makes two contributions: i) Establishing a precise CGRA performance model and formulating the loop nests mapping as a nonlinear optimization problem based on polyhedral model, ii) Extracting an efficient heuristic loop transformation and mapping algorithm (PolyMAP) to improve mapping performance. Experiment results on most kernels of the PolyBench and real-life applications show that our proposed approach can improve the performance of the kernels by 21% on average, as compared to one of the best existing mapping algorithm, EPIMap. The runtime complexity of PolyMAP is also acceptable.

international symposium on circuits and systems | 2010

A reconfigurable multi-processor SoC for media applications

Min Zhu; Leibo Liu; Shouyi Yin; Yansheng Wang; Wenjie Wang; Shaojun Wei

This paper proposes a reconfigurable multi-processor SoC for media applications called REMUS (REconfigurable Multi-media System), which consists of 512 processing engines and two ARMs. The processing engines are divided into two dynamic configuration groups, which can be easily tailored and extended. The processing engines, DBIs (Data Buffering Interface, DBI) and context interfaces build up a large throughput computing system with thread parallelism, algorithms parallelism and data parallelism. Different algorithms can be mapped in at the same time. REMUS is suitable for many applications such as media decoding and baseband processing, etc. Simulation results show that the processing capability of REMUS is to support 1920⋆1088 @30fps videos at 200 MHz in real-time decoding of H.264 high-profile streams.

custom integrated circuits conference | 2013

An energy-efficient coarse-grained dynamically reconfigurable fabric for multiple-standard video decoding applications

Leibo Liu; Chenchen Deng; Dong Wang; Min Zhu; Shouyi Yin; Peng Cao; Shaojun Wei

In this paper, we introduce a coarse-grained dynamically reconfigurable fabric, named Reconfigurable Processing Unit (RPU), which is implemented on a 5.4×3.1 mm2 silicon with TSMC 65 nm LP1P8M technology. This fabric consists of 16×16 multi-functional Processing Elements (PEs) interconnected by an area-efficient Line-Switched Mesh Connect (LSMC) routing. A Hierarchical Configuration Context (HCC) organization scheme is proposed to reduce the scale of the context memory and enhance configuration efficiency. Two reconfigurable processors are then designed and fabricated to verify the proposed techniques. One processor (called REMUS_HPP) integrates two RPUs, targeting the high performance applications. REMUS_HPP could decode 1920×1080@30fps H.264 streams with 280mW under 200MHz, achieving a performance gain of 1.81x and a 14.3x energy efficiency improvement over XPP-III. The other processor (called REMUS_LPP) integrates only one RPU, targeting the low power applications. REMUS_LPP could decode 720×480@35fps H.264 streams with 24.81mW under 75MHz, achieving a 76% power reduction and a 3.96x energy efficiency improvement compared with ADRES. More importantly, RPU is not only limited to video decoding applications. It can also be used to process some other computation-intensive applications and the corresponding analysis is given in this paper as well.

IEEE Transactions on Very Large Scale Integration Systems | 2014

SimRPU: A Simulation Environment for Reconfigurable Architecture Exploration

Leibo Liu; Dong Wang; Shouyi Yin; Yingjie Victor Chen; Min Zhu; Shaojun Wei

To assist the system architects with fast exploration and performance evaluation of the reconfigurable software/hardware architectures, this paper presents a system-level simulator, named after SimRPU, for the reconfigurable processing unit (RPU), which is the major computing engine in reconfigurable processor. The proposed simulator consists of a simulation kernel, a software compiler, a system profiler providing performance, area and power information for the desired architectures, and a system debugger supporting inspecting and modification of the internal state of the RPU. Object-oriented hierarchical and parameterized architecture modeling techniques are proposed to satisfy the requirements for a fast and comprehensive evaluation. Cycle-accurate simulation mechanisms are developed to improve the accuracy of the profiled performance data. Compared with the traditional register transfer level (RTL) based simulation scheme, the proposed simulator could achieve an average speedup of 18.5× with only 3.5% reduction on performance estimation accuracy. One reconfigurable processor targeted at high-definition multimedia decoding applications (such as H.264, MPEG2, AVS, etc.) is implemented with Taiwan Semiconductor Manufacturing Company 65-nm process using the proposed exploration and design flow. The measured results show that the implemented architecture has obvious advantages in terms of both performance and power consumption than the reference designs in multimedia decoding applications.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2015

An Efficient Application Mapping Approach for the Co-Optimization of Reliability, Energy, and Performance in Reconfigurable NoC Architectures

Chen Wu; Chenchen Deng; Leibo Liu; Jie Han; Jiqiang Chen; Shouyi Yin; Shaojun Wei

In this paper, an efficient application mapping approach is proposed for the co-optimization of reliability, communication energy, and performance (CoREP) in network-on-chip (NoC)-based reconfigurable architectures. A cost model for the CoREP is developed to evaluate the overall cost of a mapping. In this model, communication energy and latency (as a measure of performance) are first considered in energy latency product (ELP), and then ELP is co-optimized with reliability by a weight parameter that defines the optimization priority. Both transient and intermittent errors in NoC are modeled in CoREP. Based on CoREP, a mapping approach, referred to as priority and ratio oriented branch and bound (PRBB), is proposed to derive the best mapping by enumerating all the candidate mappings organized in a search tree. Two techniques, branch node priority recognition and partial cost ratio utilization, are adopted to improve the search efficiency. Experimental results show that the proposed approach achieves significant improvements in reliability, energy, and performance. Compared with the state-of-the-art methods in the same scope, the proposed approach has the following distinctive advantages: 1) CoREP is highly flexible to address various NoC topologies and routing algorithms while others are limited to some specific topologies and/or routing algorithms; 2) general quantitative evaluation for reliability, energy, and performance are made, respectively, before being integrated into unified cost model in general context while other similar models only touch upon two of them; and 3) CoREP-based PRBB attains a competitive processing speed, which is faster than other mapping approaches.

Sensors | 2015

Fast Traffic Sign Recognition with a Rotation Invariant Binary Pattern Based Feature

Shouyi Yin; Peng Ouyang; Leibo Liu; Yike Guo; Shaojun Wei

Robust and fast traffic sign recognition is very important but difficult for safe driving assistance systems. This study addresses fast and robust traffic sign recognition to enhance driving safety. The proposed method includes three stages. First, a typical Hough transformation is adopted to implement coarse-grained location of the candidate regions of traffic signs. Second, a RIBP (Rotation Invariant Binary Pattern) based feature in the affine and Gaussian space is proposed to reduce the time of traffic sign detection and achieve robust traffic sign detection in terms of scale, rotation, and illumination. Third, the techniques of ANN (Artificial Neutral Network) based feature dimension reduction and classification are designed to reduce the traffic sign recognition time. Compared with the current work, the experimental results in the public datasets show that this work achieves robustness in traffic sign recognition with comparable recognition accuracy and faster processing speed, including training speed and recognition speed.

IEEE Transactions on Very Large Scale Integration Systems | 2014

On-Chip Memory Hierarchy in One Coarse-Grained Reconfigurable Architecture to Compress Memory Space and to Reduce Reconfiguration Time and Data-Reference Time

Yansheng Wang; Leibo Liu; Shouyi Yin; Min Zhu; Peng Cao; Jun Yang; Shaojun Wei

The coarse-grained reconfigurable architecture (CGRA) is proven to be energy efficient in several specific domains. In CGRAs, the on-chip memory hierarchy, which contains the context memory and the data memory organizations, should be well considered to achieve appropriate tradeoffs among three aspects: 1) performance; 2) area; and 3) power. In this paper, two techniques called the hierarchical configuration context (HCC) and the lifetime-based data-memory organization (LDO) focusing on the context memory and the data memory organizations are proposed to compress the on-chip memory space and to reduce the reconfiguration time and the data-reference time. In the HCC, the contexts are constructed in a hierarchical fashion to completely eliminate the repetitive portions of the contexts, not only reducing the overall context storage, but also alleviating the context transportation overhead. A fast context-indexing mechanism in the HCC is proposed to achieve fast reconfiguration, as the hierarchically organized contexts can be located and accessed conveniently. In the LDO, the on-chip data are classified into two types, based on the lifetime of data. The short-lifetime data are stored in the first in first out to increase the reuse ratio of memory space automatically, whereas the long-lifetime data are stored in the radom access memory for several time references. The HCC and the LDO are used in a CGRA core called as reconfigurable processing unit (RPU). Two RPUs are integrated in a reconfigurable computing processor (RCP) called as REconfigurable MUlti-media System, High-Performance Processor (REMUS_HPP). Because of the HCC, compared with a traditional nonhierarchical system, the total context storage required in H.264 decoding is reduced by 77%. Because of the LDO, the normalized on-chip data memory size at same performance level in the REMUS_HPP is only 23.8% and 14.8% of those in XPP-III (a high-performance RCP) and ADRES (a low-power RCP). REMUS_HPP is implemented on a 48.9-mm2 silicon with TSMC 65-nm technology, using a 200-MHz working frequency to achieve 1920 × 1088 at 30 fps H.264 high-profile decoding. Compared with XPP-III, the performance of the REMUS_HPP is 1.81× boosted, whereas the energy efficiency is 4.75× higher.

design automation conference | 2015

Efficient memory partitioning for parallel data access in multidimensional arrays

Chenyue Meng; Shouyi Yin; Peng Ouyang; Leibo Liu; Shaojun Wei

Memory bandwidth bottlenecks severely restrict parallel access of data from memory arrays. To increase bandwidth, memory partitioning algorithms have been proposed to access multiple memory banks simultaneously. However, previous partitioning schemes propose complex partitioning algorithms, which leads to non-optimal memory bank space utilization and unnecessary storage overhead. In this paper, we develop an efficient memory partitioning strategy with low time complexity and low storage overhead for data access in multidimensional arrays. Experimental results show that our memory partitioning algorithm saves up to 93.7% in the amount of arithmetic operations, 96.9% in execution time and 31.1% in storage overhead, compared to the state-of-the-art approach.

Journal of Systems Architecture | 2013

A fault tolerant NoC architecture using quad-spare mesh topology and dynamic reconfiguration

Yu Ren; Leibo Liu; Shouyi Yin; Jie Han; Qinghua Wu; Shaojun Wei

Network-on-Chip (NoC) is widely used as a communication scheme in modern many-core systems. To guarantee the reliability of communication, effective fault tolerant techniques are critical for an NoC. In this paper, a novel fault tolerant architecture employing redundant routers is proposed to maintain the functionality of a network in the presence of failures. This architecture consists of a mesh of 2x2 router blocks with a spare router placed in the center of each block. This spare router provides a viable alternative when a router fails in a block. The proposed fault-tolerant architecture is therefore referred to as a quad-spare mesh. The quad-spare mesh can be dynamically reconfigured by changing control signals without altering the underlying topology. This dynamic reconfiguration and its corresponding routing algorithm are demonstrated in detail. Since the topology after reconfiguration is consistent with the original error-free 2D mesh, the proposed design is transparent to operating systems and application software. Experimental results show that the proposed design achieves significant improvements on reliability compared with those reported in the literature. Comparing the error-free system with a single router failure case, the throughput only decreases by 5.19% and latency increases by 2.40%, with about 45.9% hardware redundancy.

international symposium on circuits and systems | 2010

A VLSI design of sensor node for wireless image sensor network

Renyan Zhou; Leibo Liu; Shouyi Yin; Ao Luo; Xinkai Chen; Shaojun Wei

This paper presents a single chip VLSI architecture of wireless image sensor node, which is constituted by an enhanced embedded 8051 microcontroller, a CMOS camera interface and hardware accelerators. The algorithms and control flows of the IEEE 802.15.4 MAC layer are accelerated by hardware, results in 45% less code size compared with the conventional software stack. An innovated CFA preprocessing algorithm and JPEG-LS compressing method is adopted and implemented by hardware, which has a minimal 46.3dB PSNR, an average compression ratio of about 3.0bit/pixel and an approximately 5fps at 16MHz system clock. Furthermore, low power design and techniques are employed to extend battery life, resulting in 60mW max system power consumption when the SoC is in full working mode (i.e. processor, image processing and wireless communication are active simultaneously) in 0.18µm CMOS process.

Explore More