Y Yifan He
Eindhoven University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Y Yifan He.
international conference on embedded computer systems: architectures, modeling, and simulation | 2011
Y Yifan He; D Dongrui She; B Bart Mesman; Henk Corporaal
Transport Triggered Architectures (TTAs) possess many advantageous, such as modularity, flexibility, and scalability. As an exposed datapath architecture, TTAs can effectively reduce the register file (RF) pressure in both number of accesses and number of RF ports. However, the conventional TTAs also have some evident disadvantages, such as relative low code density, dynamic-power wasting due to separate scheduling of source operands, and inefficient support for variant immediate values. In order to preserve the merit of conventional TTAs, while solving these aforementioned issues, we propose, MOVE-Pro, a novel low power and high code density TTA architecture. With optimizations at instruction set architecture (ISA), architecture, circuit, and compiler levels, the low-power potential of TTAs is fully exploited. Moreover, with a much denser code size, TTAs performance is also improved accordingly. In a head-to-head comparison between a two-issue MOVE-Pro processor and its RISC counterpart, we shown that up to 80% of RF accesses can be reduced, and the reduction in RF power is successfully transferred to the total core power saving. Up to 11% reduction of the total core power is achieved by our MOVE-Pro processor, while the code density is almost the same as its RISC counterpart.
design, automation, and test in europe | 2012
D Dongrui She; Y Yifan He; B Bart Mesman; Henk Corporaal
In modern processor architectures, the register file (RF) consumes considerable amount of the processor power. It is well known that by allowing software to have explicit fine-grained control over the datapath, the transport-triggered architectures (TTAs) can substantially reduce the RF traffic, thereby minimizing the RF energy. However, it is important to make sure that the gain in RF is not cancelled out by the overhead due to the fine-grained datapath control, in particular, the deterioration of code density in conventional TTAs. In this paper, we analyze the potential of minimizing RF energy in MOVE-Pro, a TTA-based processor framework. We present a flexible compiler backend, which performs energy-aware instruction scheduling to push the limit of RF energy reduction. The experimental results show that with the proposed energy-aware compiler backend, MOVE-Pro is able to significantly reduce RF energy compared to its RISC/VLIW counterparts, by up to 80%. Meanwhile the code density of MOVE-Pro remains at the same level as its RISC/VLIW counterparts, allowing the energy saving in RF to be successfully transferred to total energy saving.
advanced concepts for intelligent vision systems | 2011
Y Yifan He; Z Zhenyu Ye; D Dongrui She; B Bart Mesman; Henk Corporaal
Visual servoing has been proven to obtain better performance than mechanical encoders for position acquisition. However, the often computationally intensive vision algorithms and the ever growing demands for higher frame rate make its realization very challenging. This work performs a case study on a typical industrial application, organic light emitting diode (OLED) screen printing, and demonstrates the feasibility of achieving ultra high frame rate visual servoing applications on both field programmable gate array (FPGA) and single instruction multiple data (SIMD) processors. We optimize the existing vision processing algorithm and propose a scalable FPGA implementation, which processes a frame within 102 µs. Though a dedicated FPGA implementation is extremely efficient, lack of flexibility and considerable amount of implementation time are two of its clear drawbacks. As an alternative, we propose a reconfigurable wide SIMD processor, which balances among efficiency, flexibility, and implementation effort. For input frames of 120 × 45 resolution, our SIMD can process a frame within 232 µs, sufficient to provide a throughput of 1000fps with less than 1ms latency for the whole vision servoing system. Compared to the reference realization on MicroBlaze, the proposed SIMD processor achieves a 21× performance improvement.
IEEE Transactions on Circuits and Systems for Video Technology | 2011
Y Yu Pu; Y Yifan He; Z Zhenyu Ye; S Sebastian Moreno Londono; Anteneh A. Abbo; Richard P. Kleihorst; Henk Corporaal
Looking forward to the next generation of mobile streaming computing, the demanded energy efficiency of end-user terminals will become ever stringent. The Xetal-Pro processor, which is the latest member of the Xetal low-power single-instruction multiple data (SIMD) processor family from Philips, is presented in this paper. The predecessor of Xetal-Pro, known as Xetal-II, already ranks as one of the most computational-efficient [in terms of giga operations per second (GOPS)/Watt] processors available today, however, it cannot yet achieve the demanded energy efficiency (less than 1 pJ per operation). Unlike Xetal-II, Xetal-Pro supports ultrawide supply voltage (Vdd) scaling from the nominal supply to the subthreshold region. Although aggressive Vdd scaling causes severe throughput degradation, this can be partly compensated for by the massive parallelism in the Xetal family. Xetal-II includes a large on-chip frame memory (FM), which cannot be scaled well to an ultralow Vdd hence creating a big obstacle to increase energy efficiency. Therefore, we investigate both different FM realizations and memory organization alternatives. A hybrid memory system (HMS), which reduces the non-local memory traffic and enables further Vdd scaling, is proposed. For design space exploration of the right number of the scratchpad memory (SM) entries, the corresponding data locality analysis is provided, too. Moreover, some unique circuit implementation issues of Xetal-Pro such as the customized level-shifter are also discussed. Compared to Xetal-II operating at the nominal voltage, Xetal-Pro provides up to two times energy efficiency improvement even without Vdd scaling (essentially a consequence of data localization in the SM) when delivering the same amount of ultrahigh throughput. With Vdd scaling into the sub/near threshold region, Xetal-Pro could gain more than ten times energy reduction while still delivering a high throughput of 0.69 GOPS (counting multiply and add operations only). The new insight of Xetal-Pro sheds light on the direction of future ultralow-energy SIMD processors.
rapid system prototyping | 2013
Shakith Fernando; Fm Firew Siyoum; Y Yifan He; Akash Kumar; Henk Corporaal
Heterogeneous Multiprocessor System-on-Chips (HMPSoC) are becoming popular as a means of meeting energy efficiency requirements of modern embedded systems. However, as these HMPSoCs run multimedia applications as well, they also need to meet real-time requirements. Designing these predictable HMPSoCs is a key challenge, as the current design methods for these platforms are either semi-automated, non-predictable, or have limited heterogeneity. In this paper, we propose a design framework to generate and program HMPSoC designs in a rapid and predictable manner. It takes the application specifications and the architecture model as input and generates the entire HMPSoC, for FPGA prototyping, that meets the throughput constraints. The experimental results show that our framework can provide a conservative bound on the worst-case throughput of the FPGA implementation. We also present results of a case study that computes the area-power trade-offs of an industrial vision application. The entire design space exploration of all configurations was completed in 8 hours. A tool-chain targeting the Xilinx Zynq FPGA is also presented.
compilers, architecture, and synthesis for embedded systems | 2012
D Dongrui She; Y Yifan He; Henk Corporaal
The use of special instructions that execute complex operation patterns is a common approach in application specific processor design to improve performance and efficiency. However, in an embedded generic processor with compact instruction set architecture (ISA), such instructions may lead to large overhead as: i) more bits are needed to encode the extra opcodes and operands, resulting in wider instructions; ii) more register file (RF) ports are required to provide the extra operands to the function units. Such overhead may increase energy consumption considerably. In this paper, we propose to support flexible operation pair patterns in a processor with a compact 24-bit RISC-like ISA using: i) a partially reconfigurable decoder that exploits the locality of patterns to reduce the requirement for opcode space; ii) a software controlled bypass network to reduce the requirement for operand encoding and RF ports. We also propose an energy-aware compiler backend design for the proposed architecture that performs pattern selection and bypass-aware scheduling to generate energy efficient codes. Though proposed design imposes extra constraints on the operation patterns, the experimental results show that the average dynamic instruction count is reduced by over 25%, which is only about 2% less than the architecture without such constraints. Due to the low overhead, the total energy of the proposed architecture reduces by an average of 15.8% compared to the RISC baseline, while the one without constraints achieves almost no energy improvement.
international conference on distributed smart cameras | 2008
Y Yifan He; Zoran Zivkovic; Richard P. Kleihorst; Alexander Danilin; Henk Corporaal
Implementation of the Hough transform (HT) for line detection requires massive computation, large memory space and high bandwidth. Without parallel processing on a proper platform, it can be hardly implemented in real-time, especially with high accuracy on high-resolution images. This paper proposes several efficient methods for implementation of HT on SIMD (single-instruction, multiple-data) architecture. All lines in an image/frame can be detected in real time with the proposed voting method, the novel Hough space (HS) structures, and the efficient image ldquorotationrdquo mechanism. With suggested refinement and tracking approach, we can capture and follow the target lines with very high accuracy. Analysis and comparison are elaborated with real-time implementation on our wireless smart camera (WiCa) platform, which is a powerful image/video processing platform developed by NXP Semiconductors. Moreover, two different applications, lane detection and gesture control, are also developed on this platform.
international conference on embedded computer systems architectures modeling and simulation | 2013
Ljw Luc Waeijen; D Dongrui She; Henk Corporaal; Y Yifan He
Low energy consumption has become one of the most important topics in computing. With single CPUs consuming as much as 115 Watt, engineers have been looking for ways to reduce energy consumption while maintaining high computational performance. Often wide SIMD architectures are used to achieve this, exploiting data parallelism to keep the required clock frequency low for a given compute constraint. In this paper, we propose a wide SIMD architecture with explicit datapath to further optimize energy efficiency without sacrificing computation power. To have a detailed comparison, both the proposed wide SIMD architecture and its transparent bypassing counterpart are implemented in HDL and synthesized with a TSMC 40nm low power library. The power estimation is derived from actual toggle rates generated by post-synthesis simulation. Our experimental results show that with explicit bypassing the overall energy consumption can be reduced up to 44% compared to the corresponding SIMD architecture with transparent bypassing.
international conference on embedded computer systems architectures modeling and simulation | 2013
D Dongrui She; Y Yifan He; Ljw Luc Waeijen; Henk Corporaal
Energy efficiency is one of the most important aspects in designing embedded processors. The use of a wide SIMD processor architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, we propose a configurable wide SIMD architecture that utilizes explicit datapath to achieve high energy efficiency. To efficiently program the proposed architecture with a standard parallel programming language, we introduce a tool flow that can compile and map OpenCL programs onto it. The compiler in the proposed tool flow is able to analyze the static access patterns in OpenCL kernels and generate efficient mapping and code that utilizes the explicit datapath. Experimental results show that the proposed architecture is efficient. In a 128-PE processor, the proposed architecture is able to achieve over 200 times speed-up and reduce the energy consumption of register file and memory by over 90% compared to a RISC processor.
signal processing systems | 2015
Ljw Luc Waeijen; D Dongrui She; Henk Corporaal; Y Yifan He
Energy efficiency has become one of the most important topics in computing. To meet the ever increasing demands of the mobile market, the next generation of processors will have to deliver a high compute performance at an extremely limited energy budget. Wide single instruction, multiple data (SIMD) architectures provide a promising solution, as they have the potential to achieve high compute performance at a low energy cost. We propose a configurable wide SIMD architecture that utilizes explicit datapath techniques to further optimize energy efficiency without sacrificing computational performance. To demonstrate the efficiency of the proposed architecture, multiple instantiations of the proposed wide SIMD architecture and its automatic bypassing counterpart, as well as a baseline RISC processor, are implemented. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve an average of 206 times speed-up and reduces the total energy dissipation by 48.3 % on average and up to 94 %, compared to a reduced instruction set computing (RISC) processor. Compared to the corresponding SIMD architecture with automatic bypassing, an average of 64 % of all register file accesses is avoided by the 128-PE, explicitly bypassed SIMD. For total energy dissipation, an average of 27.5 %, and maximum of 43.0 %, reduction is achieved.