Is this you? Create Your Porfile

Ljw Luc Waeijen

Eindhoven University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ljw Luc Waeijen is active.

Explore More

Publication

Featured researches published by Ljw Luc Waeijen.

international conference on embedded computer systems architectures modeling and simulation | 2013

SIMD made explicit

Ljw Luc Waeijen; D Dongrui She; Henk Corporaal; Y Yifan He

Low energy consumption has become one of the most important topics in computing. With single CPUs consuming as much as 115 Watt, engineers have been looking for ways to reduce energy consumption while maintaining high computational performance. Often wide SIMD architectures are used to achieve this, exploiting data parallelism to keep the required clock frequency low for a given compute constraint. In this paper, we propose a wide SIMD architecture with explicit datapath to further optimize energy efficiency without sacrificing computation power. To have a detailed comparison, both the proposed wide SIMD architecture and its transparent bypassing counterpart are implemented in HDL and synthesized with a TSMC 40nm low power library. The power estimation is derived from actual toggle rates generated by post-synthesis simulation. Our experimental results show that with explicit bypassing the overall energy consumption can be reduced up to 44% compared to the corresponding SIMD architecture with transparent bypassing.

international conference on embedded computer systems architectures modeling and simulation | 2013

OpenCL code generation for low energy wide SIMD architectures with explicit datapath

D Dongrui She; Y Yifan He; Ljw Luc Waeijen; Henk Corporaal

Energy efficiency is one of the most important aspects in designing embedded processors. The use of a wide SIMD processor architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, we propose a configurable wide SIMD architecture that utilizes explicit datapath to achieve high energy efficiency. To efficiently program the proposed architecture with a standard parallel programming language, we introduce a tool flow that can compile and map OpenCL programs onto it. The compiler in the proposed tool flow is able to analyze the static access patterns in OpenCL kernels and generate efficient mapping and code that utilizes the explicit datapath. Experimental results show that the proposed architecture is efficient. In a 128-PE processor, the proposed architecture is able to achieve over 200 times speed-up and reduce the energy consumption of register file and memory by over 90% compared to a RISC processor.

signal processing systems | 2015

A Low-Energy Wide SIMD Architecture with Explicit Datapath

Ljw Luc Waeijen; D Dongrui She; Henk Corporaal; Y Yifan He

Energy efficiency has become one of the most important topics in computing. To meet the ever increasing demands of the mobile market, the next generation of processors will have to deliver a high compute performance at an extremely limited energy budget. Wide single instruction, multiple data (SIMD) architectures provide a promising solution, as they have the potential to achieve high compute performance at a low energy cost. We propose a configurable wide SIMD architecture that utilizes explicit datapath techniques to further optimize energy efficiency without sacrificing computational performance. To demonstrate the efficiency of the proposed architecture, multiple instantiations of the proposed wide SIMD architecture and its automatic bypassing counterpart, as well as a baseline RISC processor, are implemented. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve an average of 206 times speed-up and reduces the total energy dissipation by 48.3 % on average and up to 94 %, compared to a reduced instruction set computing (RISC) processor. Compared to the corresponding SIMD architecture with automatic bypassing, an average of 64 % of all register file accesses is avoided by the 128-PE, explicitly bypassed SIMD. For total energy dissipation, an average of 27.5 %, and maximum of 43.0 %, reduction is achieved.

international conference on embedded computer systems architectures modeling and simulation | 2016

Coarse grained reconfigurable architectures in the past 25 years: Overview and classification

M Mark Wijtvliet; Ljw Luc Waeijen; Henk Corporaal

Reconfigurable architectures become more popular now general purpose compute performance does not increase as rapidly as before. Field programmable gate arrays are slowly moving into the direction of Coarse Grain Reconfigurable Architectures (CGRA) by adding DSP and other coarse grained IP blocks, general purpose processors become more heterogeneous and include sub-word parallelism and even some reconfigurable logic. In the past 25 years, several CGRAs have been published. In this paper an overview and classification of these architectures is presented. This work also provides a clear definition of CGRAs and identifies topics for future research which are key to unlock the full potential of CGRAs.

digital systems design | 2016

MacSim: A MAC-Enabled High-Performance Low-Power SIMD Architecture

T Tong Geng; Ljw Luc Waeijen; Mcj Maurice Peemen; Henk Corporaal; Y Yifan He

Single-Instruction-Multiple-Data (SIMD) architectures, which exploit data-level parallelism (DLP), are widely used to achieve high-performance and low-power computing. In most of streaming applications, such as CNN-based detection and recognition, color space conversion and various kinds of filters, multiply-accumulate is one of the most important and expensive operations to be executed. In this paper, we propose a high-performance low-power SIMD architecture with advanced multiply accumulator (MAC) support (MacSim) to improve the computational efficiency. In addition, a smart loop tiling scheme is proposed. To support this tiling even further, the MAC unit is equipped with multiple accumulator registers. According to the Design Space Exploration (DSE) of the proposed MAC unit, a MAC instance with four accumulator registers (MAC4reg) is selected as a good choice for target kernels. In this paper, a 64-PE 16-bit (processing element) SIMD instance without MAC support is taken as the baseline. For a head-to-head comparison, a 64-PE 16-bit SIMD with MAC4reg (MacSim4) and the baseline SIMD are all implemented in HDL and synthesized with a TSMC 40nm low-power library. Five streaming application kernels are mapped to both architectures. Our experimental results show with MAC4reg the runtime and energy consumption are reduced up to 38% and 42% respectively. Besides, a 4-layer CNN-based detection application is also fully mapped onto the proposed MacSim4. Working at 950MHz, MacSim4 reaches a throughput of 62.4 GOPS, which meets the requirement of real-time (720P HD, 30fps) detection. The energy consumption per PE per operation is very low, 4.7pJ/Op excluding SRAM (Static Random Access Memory) and 4.8pJ/Op including a 2k-entry SRAM bank. As a prototype, the proposed SIMD is mapped into an FPGA and can run all the kernels.

signal processing systems | 2015

A Co-Design Framework with OpenCL Support for Low-Energy Wide SIMD Processor

D Dongrui She; Y Yifan He; Ljw Luc Waeijen; Henk Corporaal

Energy efficiency is one of the most important metrics in embedded processor design. The use of wide SIMD architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, we propose a design framework for a configurable wide SIMD architecture that utilizes an explicit datapath to achieve high energy efficiency. The framework is able to generate processor instances based on architecture specification files. It includes a compiler to efficiently program the proposed architecture with standard programming languages including OpenCL. This compiler can analyze the static memory access patterns in OpenCL kernels, generate efficient mappings, and schedule the code to fully utilize the explicit datapath. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve up to 200 times speed-up and reduce the total energy consumption by 50 % compared to a basic RISC processor.

digital systems design | 2016

Code Generation for Reconfigurable Explicit Datapath Architectures with LLVM

M Michaël Adriaansen; M Mark Wijtvliet; R Roel Jordans; Ljw Luc Waeijen; Henk Corporaal

Good tool support is essential for computing platforms because they increase the programmability of the platform. This is especially the case for reconfigurable architectures because an application needs to be mapped on the architecture for each configuration individually. This paper investigates how the LLVM framework can be used to generate code for a Coarse Grained Reconfigurable Array (CGRA). A CGRA compiler must be retargetable to support all possible architecture configurations. The explicit bypassing capabilities of the hardware should be utilized. Utilizing the hardware features requires the compiler to support software pipelining, multiple register files and operation based scheduling. This paper evaluates the potential of the LLVM framework and identifies missing features for the support of reconfigurable explicit datapath architectures.

design automation conference | 2014

Reduction Operator for Wide-SIMDs Reconsidered

Ljw Luc Waeijen; D Dongrui She; Henk Corporaal; Y Yifan He

It has been shown that wide Single Instruction Multiple Data architectures (wide-SIMDs) can achieve high energy efficiency, especially in domains such as image and vision processing. In these and various other application domains, reduction is a frequently encountered operation, where multiple input elements need to be combined into a single element by an associative operation, e.g. addition or multiplication. There are many applications that require reduction such as: partial histogram merging, matrix multiplication and min/max-finding. Wide-SIMDs contain a large number of processing elements (PEs), which in general are connected by a minimal form of interconnect for scalability reasons. To efficiently support reduction operations on wide-SIMDs with such a minimal interconnect, we introduce two novel reduction algorithms which do not rely on complex communication networks or any dedicated hardware. The proposed approaches are compared with both dedicated hardware and other software solutions in terms of performance, area, and energy consumption. A practical case study demonstrates that the proposed software approach has much better generality, flexibility and no additional hardware cost. Compared to a dedicated hardware adder tree, the proposed software approach saves 6.8% area with a performance penalty of only 6.5%.

international conference on embedded computer systems architectures modeling and simulation | 2016

A configurable SIMD architecture with explicit datapath for intelligent learning

Y Yifan He; Mcj Maurice Peemen; Ljw Luc Waeijen; Erkan Diken; M Mattia Fiumara; Gerard K. Rauwerda; Henk Corporaal; T Tong Geng

The use of a wide Single-Instruction-Multiple-Data (SIMD) architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, based on our design framework for low-power SIMD processors, we propose a multiply-accumulate (MAC) unit with variable number of accumulator registers. The proposed MAC unit exploits both the merits of merged operation and register tiling. A Convolutional Neural Network (CNN) is a popular learning based algorithm due to its flexibility and high accuracy. However, a CNN-based application is often computationally intensive as it applies convolution operations extensively on a large data set. In this work, a CNN-based intelligent learning application is analyzed and mapped in the context of SIMD architectures. Experimental results show that the proposed architecture is efficient. In a 64-PE instance, the proposed SIMD processor with MAC4reg achieves an effective performance of 63.2 GOPS. Compared to the two baseline SIMD processors without MAC4reg, the proposed design brings 54.0% and 32.1% reduction in execution time, and 20.5% and 35.1% reduction in energy consumption respectively.

digital systems design | 2016

Multi-granular Arithmetic in a Coarse-Grain Reconfigurable Architecture

Spa Stefan Louwers; Ljw Luc Waeijen; M Mark Wijtvliet; Rpj Ruud Koolen; Henk Corporaal

Mismatch between operand width and hardware operation width is a source of energy inefficiency. This work proposes multi-granular arithmetic, which can adapt the hardware operation width to the application, preventing energy being wasted. In particular multi-granular arithmetic in the context of coarse-grain reconfigurable architectures is considered for the operations of addition, accumulation, multiplication, and multiply-accumulation. Using a silicon synthesis-toolflow it is shown that the multi-granular designs can perform narrow width operations, e.g. an 8-by-8 multiplication, much more efficiently than standard full-width circuits. For multiplication the required energy is reduced by up to 15 times under realistic conditions when compared to a full-width 32x32 multiplier.

Explore More