Yinhe Han
Chinese Academy of Sciences
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yinhe Han.
IEEE Transactions on Very Large Scale Integration Systems | 2009
Lei Zhang; Yinhe Han; Qiang Xu; Xiaowei Li; Huawei Li
Homogeneous manycore systems are emerging for tera-scale computation and typically utilize Network-on-Chip (NoC) as the communication scheme between embedded cores. Effective defect tolerance techniques are essential to improve the yield of such complex integrated circuits. We propose to achieve fault tolerance by employing redundancy at the core-level instead of at the microarchitecture level. When faulty cores exist on-chip in this architecture, however, the physical topologies of various manufactured chips can be significantly different. How to reconfigure the system with the most effective NoC topology is a relevant research problem. In this paper, we first show that this problem is an instance of a well known NP-complete problem. We then present novel solutions for the above problem, which not only maximize the performance of the on-chip communication scheme, but also provide a unified topology to Operating System and application software running on the processor. Experimental results show the effectiveness of the proposed techniques.
design, automation, and test in europe | 2008
Lei Zhang; Yinhe Han; Qiang Xu; Xiaowei Li
Homogeneous manycore processors are emerging for tera-scale computation. Effective defect tolerance techniques are essential to improve the yield of such complex integrated circuits. In this paper, we propose to achieve fault tolerance by employing redundancy at the core-level instead of at the microarchitecture-level. When faulty cores existing on-chip in this architecture, how to reconfigure the processor with the most effective topology is a relevant research problem. We present novel solutions for this problem, which not only maximize the performance of the manycore processor, but also provide a unified topology to operating system and application software running on the processor. Experimental results show the effectiveness of the proposed techniques.
design automation conference | 2011
Jianbo Dong; Lei Zhang; Yinhe Han; Ying Wang; Xiaowei Li
The limited write endurance of phase change random access memory (PRAM) is one of the major obstacles for PRAM-based main memory. Wear leveling techniques were proposed to extend its lifetime by balancing writes traffic. Another important concern that need to be considered is endurance variation in PRAM chips. When different PRAM cells have distinct endurance, balanced writes will result in lifetime degradation due to the weakest cells. Instead of balancing writes traffic, in this paper we propose wear rate leveling (WRL), a variant of wear leveling, to balance wear rates (i.e., writes traffic/edudrance) of cells across the PRAM chip. After investigating writing behavior of applications and endurance variation, we propose an architecture-level WRL mechanism. Moreover, there is an important tradeoff between endurance improvement and swapping data volume. To co-optimize endurance and swapping, a novel algorithm, Max Hyper-weight Rematching, is proposed to maximize PRAM lifetime and minimize performance degradation. Experimental results show 19x endurance improvement to prior Wear Leveling
high-performance computer architecture | 2012
Guihai Yan; Yingmin Li; Yinhe Han; Xiaowei Li; Minyi Guo; Xiaoyao Liang
The widening gap between the fast-increasing transistor budget but slow-growing power delivery and system cooling capability calls for novel architectural solutions to boost energy efficiency. Leveraging the fact of surging “dark silicon” area, we propose a hybrid scheme to use both on-chip and off-chip voltage regulators, called “AgileRegulator”, for a multicore system to explore both coarse-grain and fine-grain power phases. We present two complementary algorithms: Sensitivity-Aware Application Scheduling (SAAS) and Responsiveness-Aware Application Scheduling (RAAS) to maximally achieve the energy saving potential of the hybrid regulator scheme. Experimental results show that the hybrid scheme achieves performance-energy efficiency close to per-core DVFS, without imposing much design cost. Meanwhile, the silicon overhead of this scheme is well contained into the “dark silicon”. Unlike other application specific schemes based on accelerators, the proposed scheme itself is a simple and universal solution for chip area and energy trade-offs.
IEEE Transactions on Very Large Scale Integration Systems | 2013
Yuanqing Cheng; Lei Zhang; Yinhe Han; Xiaowei Li
3-D technology that stacks silicon dies with through silicon vias (TSVs) is a promising solution to overcome the interconnect scaling problem in giga-scale integrated circuits (ICs). Thermal dissipation is a major challenge for 3-D integration and prior thermal-balanced task scheduling methods for 3-D multiprocessor system-on-chips (MPSoCs) typically balance power gradient across vertical stacks based on the assumption of strong thermal correlation among processing cores within a stack. On the other hand, 3-D MPSoCs typically employ network-on-chip (NoC) as the communication infrastructure which consumes a large portion of the energy budget. As TSVs consume much less energy than horizontal links in 3-D MPSoCs when transmitting the same amount data due to the reduced interconnect distance between vertical adjacent cores, it motivates to allocate heavily communicating tasks within the same vertical stack as much as possible, and thus traffic is restricted in the third dimension to reduce interconnect energy. However, aggregating active tasks within the same stack probably exacerbates the power density and result in hot spots. In this paper, we explore the tradeoff between thermal and interconnect energy when allocating tasks in 3-D Homogeneous MPSoCs, and propose an efficient heuristic. Experimental results show that the proposed technique can reduce interconnect energy by more than 25% on average with almost the same peak temperature when compared with prior thermal-balanced solutions.
design automation conference | 2016
Ying Wang; Jie Xu; Yinhe Han; Huawei Li; Xiaowei Li
Recent advances in Neural Networks (NN) are enabling more and more innovative applications. As an energy-efficient hardware solution, machine learning accelerators for CNNs or traditional ANNs are also gaining popularity in the area of embedded vision, robotics and cyberphysics. However, the design parameters of NN models vary significantly from application to application. Hence, its hard to provide one general and highly-efficient hardware solution to accommodate all of them, and it is also impractical for the domain-specific developers to customize their own hardware targeting on a specific NN model. To deal with this dilemma, this study proposes a design automation tool, DeepBurning, allowing the application developers to build from scratch learning accelerators that targets their specific NN models with custom configurations and optimized performance. DeepBurning includes a RTL-level accelerator generator and a coordinated compiler that generates the control flow and data layout under the user-specified constraints. The results can be used to implement FPGA-based NN accelerator or help generate chip design for early design stage. In general, DeepBurning supports a large family of NN models, and greatly simplifies the design flow of NN accelerators for the machine learning or AI application developers. The evaluation shows that the generated learning accelerators burnt to our FPGA board exhibit great power efficiency compared to state-of-the-art FPGA-based solutions.
design automation conference | 2016
Lili Song; Ying Wang; Yinhe Han; Xin Zhao; Bosheng Liu; Xiaowei Li
Convolutional neural networks (CNN) accelerators have been proposed as an efficient hardware solution for deep learning based applications, which are known to be both compute-and-memory intensive. Although the most advanced CNN accelerators can deliver high computational throughput, the performance is highly unstable. Once changed to accommodate a new network with different parameters like layers and kernel size, the fixed hardware structure, may no longer well match the data flows. Consequently, the accelerator will fail to deliver high performance due to the underutilization of either logic resource or memory bandwidth. To overcome this problem, we proposed a novel deep learning accelerator, which offers multiple types of data-level parallelism: inter-kernel, intra-kernel and hybrid. Our design can adaptively switch among the three types of parallelism and the corresponding data tiling schemes to dynamically match different networks or even different layers of a single network. No matter how we change the hardware configurations or network types, the proposed network mapping strategy ensures the optimal performance and energy-efficiency. Compared with previous state-of-the-art NN accelerators, it is possible to achieve a speedup of 4.0x-8.3x for some layers of the well-known large scale CNNs. For the whole phase of network forward-propagation, our design achieves 28.04% PE energy saving, 90.3% on-chip memory energy saving on average.
IEEE Transactions on Very Large Scale Integration Systems | 2007
Yinhe Han; Yu Hu; Xiaowei Li; Huawei Li; Anshuman Chandra
An embedded test stimulus decompressor is presented for the test patterns decompression, which can reduce the required channels and vector memory of automatic test equipment (ATE) for complex processor circuit. The proposed decompressor mainly consists of a periodically alterable MUX network which has multiple configurations to decode the input information flexibly and efficiently. In order to reduce the number of test patterns and configurations, a test patterns compaction algorithm, using CI-Graph merging, is proposed. With the proposed periodically alterable MUX network and the patterns compaction algorithm, smaller test data volume and required external pins can be achieved as compared to previous techniques
asian test symposium | 2003
Yinhe Han; Yongjun Xu; Huawei Li; Xiaowei Li; Chandra
This paper presents a test resource partitioning technique based on an efficient single-output response compaction design called quotient compactor (q-Compactor). Some design theorems of quotient compactor are presented to achieve full diagnostics ability, minimize error cancellation and handle the X bits in the outputs of the CUT The quotient compactor can also be moved to the load-board to reduce the number of ATE channels required. Our experimental results on the ISCA S89 benchmark circuits and an MPEG 2 decoder SOC show that the proposed compaction scheme is very efficient.
international test conference | 2010
Huawei Li; Dawen Xu; Yinhe Han; Kwang-Ting Cheng; Xiaowei Li
We present nGFSIM, a GPU-based fault simulator for stuck-at faults which can report the fault coverage of one-to n-detection for any specified integer n using only a single run of fault simulation. nGFSIM, which explores the massive parallelism in the GPU architecture and optimizes the memory access and usage, enables accelerated fault simulation without the need of fault dropping. We show that nGFSIM offers a 25X speedup in comparison with a commercial tool and enables new applications in test selection.