Is this you? Create Your Porfile

D Dongrui She

Eindhoven University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where D Dongrui She is active.

Explore More

Publication

Featured researches published by D Dongrui She.

international conference on embedded computer systems: architectures, modeling, and simulation | 2011

MOVE-Pro: A low power and high code density TTA architecture

Y Yifan He; D Dongrui She; B Bart Mesman; Henk Corporaal

Transport Triggered Architectures (TTAs) possess many advantageous, such as modularity, flexibility, and scalability. As an exposed datapath architecture, TTAs can effectively reduce the register file (RF) pressure in both number of accesses and number of RF ports. However, the conventional TTAs also have some evident disadvantages, such as relative low code density, dynamic-power wasting due to separate scheduling of source operands, and inefficient support for variant immediate values. In order to preserve the merit of conventional TTAs, while solving these aforementioned issues, we propose, MOVE-Pro, a novel low power and high code density TTA architecture. With optimizations at instruction set architecture (ISA), architecture, circuit, and compiler levels, the low-power potential of TTAs is fully exploited. Moreover, with a much denser code size, TTAs performance is also improved accordingly. In a head-to-head comparison between a two-issue MOVE-Pro processor and its RISC counterpart, we shown that up to 80% of RF accesses can be reduced, and the reduction in RF power is successfully transferred to the total core power saving. Up to 11% reduction of the total core power is achieved by our MOVE-Pro processor, while the code density is almost the same as its RISC counterpart.

design, automation, and test in europe | 2012

Scheduling for register file energy minimization in explicit datapath architectures

D Dongrui She; Y Yifan He; B Bart Mesman; Henk Corporaal

In modern processor architectures, the register file (RF) consumes considerable amount of the processor power. It is well known that by allowing software to have explicit fine-grained control over the datapath, the transport-triggered architectures (TTAs) can substantially reduce the RF traffic, thereby minimizing the RF energy. However, it is important to make sure that the gain in RF is not cancelled out by the overhead due to the fine-grained datapath control, in particular, the deterioration of code density in conventional TTAs. In this paper, we analyze the potential of minimizing RF energy in MOVE-Pro, a TTA-based processor framework. We present a flexible compiler backend, which performs energy-aware instruction scheduling to push the limit of RF energy reduction. The experimental results show that with the proposed energy-aware compiler backend, MOVE-Pro is able to significantly reduce RF energy compared to its RISC/VLIW counterparts, by up to 80%. Meanwhile the code density of MOVE-Pro remains at the same level as its RISC/VLIW counterparts, allowing the energy saving in RF to be successfully transferred to total energy saving.

advanced concepts for intelligent vision systems | 2011

Feasibility analysis of ultra high frame rate visual servoing on FPGA and SIMD processor

Y Yifan He; Z Zhenyu Ye; D Dongrui She; B Bart Mesman; Henk Corporaal

Visual servoing has been proven to obtain better performance than mechanical encoders for position acquisition. However, the often computationally intensive vision algorithms and the ever growing demands for higher frame rate make its realization very challenging. This work performs a case study on a typical industrial application, organic light emitting diode (OLED) screen printing, and demonstrates the feasibility of achieving ultra high frame rate visual servoing applications on both field programmable gate array (FPGA) and single instruction multiple data (SIMD) processors. We optimize the existing vision processing algorithm and propose a scalable FPGA implementation, which processes a frame within 102 µs. Though a dedicated FPGA implementation is extremely efficient, lack of flexibility and considerable amount of implementation time are two of its clear drawbacks. As an alternative, we propose a reconfigurable wide SIMD processor, which balances among efficiency, flexibility, and implementation effort. For input frames of 120 × 45 resolution, our SIMD can process a frame within 232 µs, sufficient to provide a throughput of 1000fps with less than 1ms latency for the whole vision servoing system. Compared to the reference realization on MicroBlaze, the proposed SIMD processor achieves a 21× performance improvement.

compilers, architecture, and synthesis for embedded systems | 2012

Energy efficient special instruction support in an embedded processor with compact isa

D Dongrui She; Y Yifan He; Henk Corporaal

The use of special instructions that execute complex operation patterns is a common approach in application specific processor design to improve performance and efficiency. However, in an embedded generic processor with compact instruction set architecture (ISA), such instructions may lead to large overhead as: i) more bits are needed to encode the extra opcodes and operands, resulting in wider instructions; ii) more register file (RF) ports are required to provide the extra operands to the function units. Such overhead may increase energy consumption considerably. In this paper, we propose to support flexible operation pair patterns in a processor with a compact 24-bit RISC-like ISA using: i) a partially reconfigurable decoder that exploits the locality of patterns to reduce the requirement for opcode space; ii) a software controlled bypass network to reduce the requirement for operand encoding and RF ports. We also propose an energy-aware compiler backend design for the proposed architecture that performs pattern selection and bypass-aware scheduling to generate energy efficient codes. Though proposed design imposes extra constraints on the operation patterns, the experimental results show that the average dynamic instruction count is reduced by over 25%, which is only about 2% less than the architecture without such constraints. Due to the low overhead, the total energy of the proposed architecture reduces by an average of 15.8% compared to the RISC baseline, while the one without constraints achieves almost no energy improvement.

international conference on embedded computer systems architectures modeling and simulation | 2013

SIMD made explicit

Ljw Luc Waeijen; D Dongrui She; Henk Corporaal; Y Yifan He

Low energy consumption has become one of the most important topics in computing. With single CPUs consuming as much as 115 Watt, engineers have been looking for ways to reduce energy consumption while maintaining high computational performance. Often wide SIMD architectures are used to achieve this, exploiting data parallelism to keep the required clock frequency low for a given compute constraint. In this paper, we propose a wide SIMD architecture with explicit datapath to further optimize energy efficiency without sacrificing computation power. To have a detailed comparison, both the proposed wide SIMD architecture and its transparent bypassing counterpart are implemented in HDL and synthesized with a TSMC 40nm low power library. The power estimation is derived from actual toggle rates generated by post-synthesis simulation. Our experimental results show that with explicit bypassing the overall energy consumption can be reduced up to 44% compared to the corresponding SIMD architecture with transparent bypassing.

international conference on embedded computer systems architectures modeling and simulation | 2013

OpenCL code generation for low energy wide SIMD architectures with explicit datapath

D Dongrui She; Y Yifan He; Ljw Luc Waeijen; Henk Corporaal

Energy efficiency is one of the most important aspects in designing embedded processors. The use of a wide SIMD processor architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, we propose a configurable wide SIMD architecture that utilizes explicit datapath to achieve high energy efficiency. To efficiently program the proposed architecture with a standard parallel programming language, we introduce a tool flow that can compile and map OpenCL programs onto it. The compiler in the proposed tool flow is able to analyze the static access patterns in OpenCL kernels and generate efficient mapping and code that utilizes the explicit datapath. Experimental results show that the proposed architecture is efficient. In a 128-PE processor, the proposed architecture is able to achieve over 200 times speed-up and reduce the energy consumption of register file and memory by over 90% compared to a RISC processor.

signal processing systems | 2015

A Low-Energy Wide SIMD Architecture with Explicit Datapath

Ljw Luc Waeijen; D Dongrui She; Henk Corporaal; Y Yifan He

Energy efficiency has become one of the most important topics in computing. To meet the ever increasing demands of the mobile market, the next generation of processors will have to deliver a high compute performance at an extremely limited energy budget. Wide single instruction, multiple data (SIMD) architectures provide a promising solution, as they have the potential to achieve high compute performance at a low energy cost. We propose a configurable wide SIMD architecture that utilizes explicit datapath techniques to further optimize energy efficiency without sacrificing computational performance. To demonstrate the efficiency of the proposed architecture, multiple instantiations of the proposed wide SIMD architecture and its automatic bypassing counterpart, as well as a baseline RISC processor, are implemented. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve an average of 206 times speed-up and reduces the total energy dissipation by 48.3 % on average and up to 94 %, compared to a reduced instruction set computing (RISC) processor. Compared to the corresponding SIMD architecture with automatic bypassing, an average of 64 % of all register file accesses is avoided by the 128-PE, explicitly bypassed SIMD. For total energy dissipation, an average of 27.5 %, and maximum of 43.0 %, reduction is achieved.

signal processing systems | 2015

A Co-Design Framework with OpenCL Support for Low-Energy Wide SIMD Processor

D Dongrui She; Y Yifan He; Ljw Luc Waeijen; Henk Corporaal

Energy efficiency is one of the most important metrics in embedded processor design. The use of wide SIMD architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, we propose a design framework for a configurable wide SIMD architecture that utilizes an explicit datapath to achieve high energy efficiency. The framework is able to generate processor instances based on architecture specification files. It includes a compiler to efficiently program the proposed architecture with standard programming languages including OpenCL. This compiler can analyze the static memory access patterns in OpenCL kernels, generate efficient mappings, and schedule the code to fully utilize the explicit datapath. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve up to 200 times speed-up and reduce the total energy consumption by 50 % compared to a basic RISC processor.

ACM Transactions on Architecture and Code Optimization | 2013

An energy-efficient method of supporting flexible special instructions in an embedded processor with compact ISA

D Dongrui She; Y Yifan He; Henk Corporaal

In application-specific processor design, a common approach to improve performance and efficiency is to use special instructions that execute complex operation patterns. However, in a generic embedded processor with compact Instruction Set Architecture (ISA), these special instructions may lead to large overhead such as: (i) more bits are needed to encode the extra opcodes and operands, resulting in wider instructions; (ii) more Register File (RF) ports are required to provide the extra operands to the function units. Such overhead may increase energy consumption considerably. In this article, we propose to support flexible operation pair patterns in a processor with a compact 24-bit RISC-like ISA using: (i) a partially reconfigurable decoder that exploits the pattern locality to reduce opcode space requirement; (ii) a software-controlled bypass network to reduce operand encoding bit and RF port requirement. An energy-aware compiler backend is designed for the proposed architecture that performs pattern selection and bypass-aware scheduling to generate energy-efficient codes. Though the proposed design imposes extra constraints on the operation patterns, the experimental results show that for benchmark applications from different domains, the average dynamic instruction count is reduced by over 25%, which is only about 2% less than the architecture without such constraints. The proposed architecture reduces total energy by an average of 15.8% compared to the RISC baseline, while the one without constraints achieves almost no improvement due to its high overhead. When high performance is required, the proposed architecture is able to achieve a speedup of 13.8% with 13.1% energy reduction compared to the baseline by introducing multicycle SFU operations.

design automation conference | 2014

Reduction Operator for Wide-SIMDs Reconsidered

Ljw Luc Waeijen; D Dongrui She; Henk Corporaal; Y Yifan He

It has been shown that wide Single Instruction Multiple Data architectures (wide-SIMDs) can achieve high energy efficiency, especially in domains such as image and vision processing. In these and various other application domains, reduction is a frequently encountered operation, where multiple input elements need to be combined into a single element by an associative operation, e.g. addition or multiplication. There are many applications that require reduction such as: partial histogram merging, matrix multiplication and min/max-finding. Wide-SIMDs contain a large number of processing elements (PEs), which in general are connected by a minimal form of interconnect for scalability reasons. To efficiently support reduction operations on wide-SIMDs with such a minimal interconnect, we introduce two novel reduction algorithms which do not rely on complex communication networks or any dedicated hardware. The proposed approaches are compared with both dedicated hardware and other software solutions in terms of performance, area, and energy consumption. A practical case study demonstrates that the proposed software approach has much better generality, flexibility and no additional hardware cost. Compared to a dedicated hardware adder tree, the proposed software approach saves 6.8% area with a performance penalty of only 6.5%.

Explore More