Ljw Luc Waeijen
Eindhoven University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ljw Luc Waeijen.
international conference on embedded computer systems architectures modeling and simulation | 2013
Ljw Luc Waeijen; D Dongrui She; Henk Corporaal; Y Yifan He
Low energy consumption has become one of the most important topics in computing. With single CPUs consuming as much as 115 Watt, engineers have been looking for ways to reduce energy consumption while maintaining high computational performance. Often wide SIMD architectures are used to achieve this, exploiting data parallelism to keep the required clock frequency low for a given compute constraint. In this paper, we propose a wide SIMD architecture with explicit datapath to further optimize energy efficiency without sacrificing computation power. To have a detailed comparison, both the proposed wide SIMD architecture and its transparent bypassing counterpart are implemented in HDL and synthesized with a TSMC 40nm low power library. The power estimation is derived from actual toggle rates generated by post-synthesis simulation. Our experimental results show that with explicit bypassing the overall energy consumption can be reduced up to 44% compared to the corresponding SIMD architecture with transparent bypassing.
international conference on embedded computer systems architectures modeling and simulation | 2013
D Dongrui She; Y Yifan He; Ljw Luc Waeijen; Henk Corporaal
Energy efficiency is one of the most important aspects in designing embedded processors. The use of a wide SIMD processor architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, we propose a configurable wide SIMD architecture that utilizes explicit datapath to achieve high energy efficiency. To efficiently program the proposed architecture with a standard parallel programming language, we introduce a tool flow that can compile and map OpenCL programs onto it. The compiler in the proposed tool flow is able to analyze the static access patterns in OpenCL kernels and generate efficient mapping and code that utilizes the explicit datapath. Experimental results show that the proposed architecture is efficient. In a 128-PE processor, the proposed architecture is able to achieve over 200 times speed-up and reduce the energy consumption of register file and memory by over 90% compared to a RISC processor.
signal processing systems | 2015
Ljw Luc Waeijen; D Dongrui She; Henk Corporaal; Y Yifan He
Energy efficiency has become one of the most important topics in computing. To meet the ever increasing demands of the mobile market, the next generation of processors will have to deliver a high compute performance at an extremely limited energy budget. Wide single instruction, multiple data (SIMD) architectures provide a promising solution, as they have the potential to achieve high compute performance at a low energy cost. We propose a configurable wide SIMD architecture that utilizes explicit datapath techniques to further optimize energy efficiency without sacrificing computational performance. To demonstrate the efficiency of the proposed architecture, multiple instantiations of the proposed wide SIMD architecture and its automatic bypassing counterpart, as well as a baseline RISC processor, are implemented. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve an average of 206 times speed-up and reduces the total energy dissipation by 48.3 % on average and up to 94 %, compared to a reduced instruction set computing (RISC) processor. Compared to the corresponding SIMD architecture with automatic bypassing, an average of 64 % of all register file accesses is avoided by the 128-PE, explicitly bypassed SIMD. For total energy dissipation, an average of 27.5 %, and maximum of 43.0 %, reduction is achieved.
international conference on embedded computer systems architectures modeling and simulation | 2016
M Mark Wijtvliet; Ljw Luc Waeijen; Henk Corporaal
Reconfigurable architectures become more popular now general purpose compute performance does not increase as rapidly as before. Field programmable gate arrays are slowly moving into the direction of Coarse Grain Reconfigurable Architectures (CGRA) by adding DSP and other coarse grained IP blocks, general purpose processors become more heterogeneous and include sub-word parallelism and even some reconfigurable logic. In the past 25 years, several CGRAs have been published. In this paper an overview and classification of these architectures is presented. This work also provides a clear definition of CGRAs and identifies topics for future research which are key to unlock the full potential of CGRAs.
digital systems design | 2016
T Tong Geng; Ljw Luc Waeijen; Mcj Maurice Peemen; Henk Corporaal; Y Yifan He
Single-Instruction-Multiple-Data (SIMD) architectures, which exploit data-level parallelism (DLP), are widely used to achieve high-performance and low-power computing. In most of streaming applications, such as CNN-based detection and recognition, color space conversion and various kinds of filters, multiply-accumulate is one of the most important and expensive operations to be executed. In this paper, we propose a high-performance low-power SIMD architecture with advanced multiply accumulator (MAC) support (MacSim) to improve the computational efficiency. In addition, a smart loop tiling scheme is proposed. To support this tiling even further, the MAC unit is equipped with multiple accumulator registers. According to the Design Space Exploration (DSE) of the proposed MAC unit, a MAC instance with four accumulator registers (MAC4reg) is selected as a good choice for target kernels. In this paper, a 64-PE 16-bit (processing element) SIMD instance without MAC support is taken as the baseline. For a head-to-head comparison, a 64-PE 16-bit SIMD with MAC4reg (MacSim4) and the baseline SIMD are all implemented in HDL and synthesized with a TSMC 40nm low-power library. Five streaming application kernels are mapped to both architectures. Our experimental results show with MAC4reg the runtime and energy consumption are reduced up to 38% and 42% respectively. Besides, a 4-layer CNN-based detection application is also fully mapped onto the proposed MacSim4. Working at 950MHz, MacSim4 reaches a throughput of 62.4 GOPS, which meets the requirement of real-time (720P HD, 30fps) detection. The energy consumption per PE per operation is very low, 4.7pJ/Op excluding SRAM (Static Random Access Memory) and 4.8pJ/Op including a 2k-entry SRAM bank. As a prototype, the proposed SIMD is mapped into an FPGA and can run all the kernels.
signal processing systems | 2015
D Dongrui She; Y Yifan He; Ljw Luc Waeijen; Henk Corporaal
Energy efficiency is one of the most important metrics in embedded processor design. The use of wide SIMD architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, we propose a design framework for a configurable wide SIMD architecture that utilizes an explicit datapath to achieve high energy efficiency. The framework is able to generate processor instances based on architecture specification files. It includes a compiler to efficiently program the proposed architecture with standard programming languages including OpenCL. This compiler can analyze the static memory access patterns in OpenCL kernels, generate efficient mappings, and schedule the code to fully utilize the explicit datapath. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve up to 200 times speed-up and reduce the total energy consumption by 50 % compared to a basic RISC processor.
digital systems design | 2016
M Michaël Adriaansen; M Mark Wijtvliet; R Roel Jordans; Ljw Luc Waeijen; Henk Corporaal
Good tool support is essential for computing platforms because they increase the programmability of the platform. This is especially the case for reconfigurable architectures because an application needs to be mapped on the architecture for each configuration individually. This paper investigates how the LLVM framework can be used to generate code for a Coarse Grained Reconfigurable Array (CGRA). A CGRA compiler must be retargetable to support all possible architecture configurations. The explicit bypassing capabilities of the hardware should be utilized. Utilizing the hardware features requires the compiler to support software pipelining, multiple register files and operation based scheduling. This paper evaluates the potential of the LLVM framework and identifies missing features for the support of reconfigurable explicit datapath architectures.
design automation conference | 2014
Ljw Luc Waeijen; D Dongrui She; Henk Corporaal; Y Yifan He
It has been shown that wide Single Instruction Multiple Data architectures (wide-SIMDs) can achieve high energy efficiency, especially in domains such as image and vision processing. In these and various other application domains, reduction is a frequently encountered operation, where multiple input elements need to be combined into a single element by an associative operation, e.g. addition or multiplication. There are many applications that require reduction such as: partial histogram merging, matrix multiplication and min/max-finding. Wide-SIMDs contain a large number of processing elements (PEs), which in general are connected by a minimal form of interconnect for scalability reasons. To efficiently support reduction operations on wide-SIMDs with such a minimal interconnect, we introduce two novel reduction algorithms which do not rely on complex communication networks or any dedicated hardware. The proposed approaches are compared with both dedicated hardware and other software solutions in terms of performance, area, and energy consumption. A practical case study demonstrates that the proposed software approach has much better generality, flexibility and no additional hardware cost. Compared to a dedicated hardware adder tree, the proposed software approach saves 6.8% area with a performance penalty of only 6.5%.
international conference on embedded computer systems architectures modeling and simulation | 2016
Y Yifan He; Mcj Maurice Peemen; Ljw Luc Waeijen; Erkan Diken; M Mattia Fiumara; Gerard K. Rauwerda; Henk Corporaal; T Tong Geng
The use of a wide Single-Instruction-Multiple-Data (SIMD) architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, based on our design framework for low-power SIMD processors, we propose a multiply-accumulate (MAC) unit with variable number of accumulator registers. The proposed MAC unit exploits both the merits of merged operation and register tiling. A Convolutional Neural Network (CNN) is a popular learning based algorithm due to its flexibility and high accuracy. However, a CNN-based application is often computationally intensive as it applies convolution operations extensively on a large data set. In this work, a CNN-based intelligent learning application is analyzed and mapped in the context of SIMD architectures. Experimental results show that the proposed architecture is efficient. In a 64-PE instance, the proposed SIMD processor with MAC4reg achieves an effective performance of 63.2 GOPS. Compared to the two baseline SIMD processors without MAC4reg, the proposed design brings 54.0% and 32.1% reduction in execution time, and 20.5% and 35.1% reduction in energy consumption respectively.
digital systems design | 2016
Spa Stefan Louwers; Ljw Luc Waeijen; M Mark Wijtvliet; Rpj Ruud Koolen; Henk Corporaal
Mismatch between operand width and hardware operation width is a source of energy inefficiency. This work proposes multi-granular arithmetic, which can adapt the hardware operation width to the application, preventing energy being wasted. In particular multi-granular arithmetic in the context of coarse-grain reconfigurable architectures is considered for the operations of addition, accumulation, multiplication, and multiply-accumulation. Using a silicon synthesis-toolflow it is shown that the multi-granular designs can perform narrow width operations, e.g. an 8-by-8 multiplication, much more efficiently than standard full-width circuits. For multiplication the required energy is reduced by up to 15 times under realistic conditions when compared to a full-width 32x32 multiplier.