Mcj Maurice Peemen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mcj Maurice Peemen is active.

Explore More

Publication

Featured researches published by Mcj Maurice Peemen.

international conference on computer design | 2013

Memory-centric accelerator design for Convolutional Neural Networks

Mcj Maurice Peemen; Aaa Setio; B Bart Mesman; Henk Corporaal

In the near future, cameras will be used everywhere as flexible sensors for numerous applications. For mobility and privacy reasons, the required image processing should be local on embedded computer platforms with performance requirements and energy constraints. Dedicated acceleration of Convolutional Neural Networks (CNN) can achieve these targets with enough flexibility to perform multiple vision tasks. A challenging problem for the design of efficient accelerators is the limited amount of external memory bandwidth. We show that the effects of the memory bottleneck can be reduced by a flexible memory hierarchy that supports the complex data access patterns in CNN workload. The efficiency of the on-chip memories is maximized by our scheduler that uses tiling to optimize for data locality. Our design flow ensures that on-chip memory size is minimized, which reduces area and energy usage. The design flow is evaluated by a High Level Synthesis implementation on a Virtex 6 FPGA board. Compared to accelerators with standard scratchpad memories the FPGA resources can be reduced up to 13× while maintaining the same performance. Alternatively, when the same amount of FPGA resources is used our accelerators are up to 11× faster.

advanced concepts for intelligent vision systems | 2011

Efficiency optimization of trainable feature extractors for a consumer platform

Mcj Maurice Peemen; B Bart Mesman; Henk Corporaal

This paper proposes an algorithmic optimization for the feature extractors of biologically inspired Convolutional Neural Networks (CNNs). CNNs are successfully used for different visual pattern recognition applications such as OCR, face detection and object classification. These applications require complex networks exceeding 100,000 interconnected computational nodes. To reduce the computational complexity a modified algorithm is proposed; real benchmarks show 65 - 83% reduction, with equal or even better recognition accuracy. Exploiting the available parallelism in CNNs is essential to reduce the computational scaling problems. Therefore the modified version of the algorithm is implemented and evaluated on a GPU platform to demonstrate the suitability on a cost effective parallel platform. A speedup of 2.5x with respect to the standard algorithm is achieved.

design, automation, and test in europe | 2015

Inter-tile reuse optimization applied to bandwidth constrained embedded accelerators

Mcj Maurice Peemen; B Bart Mesman; Henk Corporaal

The adoption of High-Level Synthesis (HLS) tools has significantly reduced accelerator design time. A complex scaling problem that remains is the data transfer bottleneck. To scale-up performance accelerators require huge amounts of data, and are often limited by interconnect resources. In addition, the energy spent by the accelerator is often dominated by the transfer of data, either in the form of memory references or data movement on interconnect. In this paper we drastically reduce accelerator communication by exploration of computation reordering and local buffer usage. Consequently, we present a new analytical methodology to optimize nested loops for inter-tile data reuse with loop transformations like interchange and tiling. We focus on embedded accelerators that can be used in a multi-accelerator System on Chip (SoC), so performance, area, and energy are key in this exploration. 1) On three common embedded applications in the image/video processing domain (demosaicing, block matching, object detection), we show that our methodology reduces data movement up to 2.1x compared to the best case of intra-tile optimization. 2) We demonstrate that our small accelerators (1-3% FPGA resources) can boost a simple MicroBlaze soft-core to the performance level of a high-end Intel-i7 processor.

Concurrency and Computation: Practice and Experience | 2015

Demystifying the 16 × 16 thread-block for stencils on the GPU

Siham Tabik; Mcj Maurice Peemen; Nicolás Guil; Henk Corporaal

Stencil computation is of paramount importance in many fields, in image processing, structural biology and biomedicine, among others. There exists a permanent demand of maximizing the performance of stencils on state‐of‐the‐art architectures, such graphics processing units (GPUs). One of the important issues when optimizing these kernels for the GPU is the selection of the best thread‐block that maximizes the overall performance. Usually, programmers look for the optimal thread‐block configuration in a reduced space of square thread‐block configurations or simply use the best configurations reported in previous works, which is usually 16 × 16. This paper provides a better understanding of the impact of thread‐block configurations on the performance of stencils on the GPU. In particular, we model locality and parallelism and consider that the optimal configurations are within the space that provides: (1) a small number of global memory communications; (2) a good shared memory utilization with small numbers of conflicts; (3) a good streaming multi‐processors utilization; and (4) a high efficiency of the threads within a thread‐block. The model determines the set of optimal thread‐block configurations without the need of executing the code. We validate the proposed model using six stencils with different halo widths and show that it reduces the optimization space to around 25% of the total valid space. The configurations in this space achieve at least a throughput of 75% of the best configuration and guarantee the inclusion of the best configurations. Copyright

digital systems design | 2016

MacSim: A MAC-Enabled High-Performance Low-Power SIMD Architecture

T Tong Geng; Ljw Luc Waeijen; Mcj Maurice Peemen; Henk Corporaal; Y Yifan He

Single-Instruction-Multiple-Data (SIMD) architectures, which exploit data-level parallelism (DLP), are widely used to achieve high-performance and low-power computing. In most of streaming applications, such as CNN-based detection and recognition, color space conversion and various kinds of filters, multiply-accumulate is one of the most important and expensive operations to be executed. In this paper, we propose a high-performance low-power SIMD architecture with advanced multiply accumulator (MAC) support (MacSim) to improve the computational efficiency. In addition, a smart loop tiling scheme is proposed. To support this tiling even further, the MAC unit is equipped with multiple accumulator registers. According to the Design Space Exploration (DSE) of the proposed MAC unit, a MAC instance with four accumulator registers (MAC4reg) is selected as a good choice for target kernels. In this paper, a 64-PE 16-bit (processing element) SIMD instance without MAC support is taken as the baseline. For a head-to-head comparison, a 64-PE 16-bit SIMD with MAC4reg (MacSim4) and the baseline SIMD are all implemented in HDL and synthesized with a TSMC 40nm low-power library. Five streaming application kernels are mapped to both architectures. Our experimental results show with MAC4reg the runtime and energy consumption are reduced up to 38% and 42% respectively. Besides, a 4-layer CNN-based detection application is also fully mapped onto the proposed MacSim4. Working at 950MHz, MacSim4 reaches a throughput of 62.4 GOPS, which meets the requirement of real-time (720P HD, 30fps) detection. The energy consumption per PE per operation is very low, 4.7pJ/Op excluding SRAM (Static Random Access Memory) and 4.8pJ/Op including a 2k-entry SRAM bank. As a prototype, the proposed SIMD is mapped into an FPGA and can run all the kernels.

software and compilers for embedded systems | 2015

VLIW Code Generation for a Convolutional Network Accelerator

Mcj Maurice Peemen; W Wisnu Pramadi; B Bart Mesman; Henk Corporaal

This paper presents a compiler flow to map Deep Convolutional Networks (ConvNets) to a highly specialized VLIW accelerator core targeting the low-power embedded market. Earlier works have focused on energy efficient accelerators for this class of algorithms, but none of them provides a complete and practical programming model. Due to the large parameter set of a ConvNet it is essential that the user can abstract from the accelerator architecture and does not have to rely on an error prone and ad-hoc assembly programming model. By using modulo scheduling for software pipelining we demonstrate that our automatic generated code achieves equal or within 5-20% less hardware utilization w.r.t. code written manually by experts. Our compiler removes the huge manual workload to efficiently map ConvNets to an energy-efficient core for the next-generation mobile and wearable devices.

digital systems design | 2015

A Locality Aware Convolutional Neural Networks Accelerator

Runbin Shi; Zheng Xu; Zhihao Sun; Mcj Maurice Peemen; A Ang Li; Henk Corporaal; D Di Wu

The advantages of Convolutional Neural Networks (CNNs) with respect to traditional methods for visual pattern recognition have changed the field of machine vision. The main issue that hinders broad adoption of this technique is the massive computing workload in CNN that prevents real-time implementation on low-power embedded platforms. Recently, several dedicated solutions have been proposed to improve the energy efficiency and throughput, nevertheless the huge amount of data transfer involved in the processing is still a challenging issue. This work proposes a new CNN accelerator exploiting a novel memory access scheme which significantly improves data locality in CNN related processing. With this scheme, external memory access is reduced by 50% while achieving similar or even better throughput. The accelerator is implemented using 28nm CMOS technology. Implementation results show that the accelerator achieves a performance of 102GOp/s @800MHz while consuming 0.303mm2 in silicon area. Power simulation shows that the dynamic power of the accelerator is 68mW. Its flexibility is demonstrated by running various different CNN benchmarks.

international conference on embedded computer systems architectures modeling and simulation | 2016

A configurable SIMD architecture with explicit datapath for intelligent learning

Y Yifan He; Mcj Maurice Peemen; Ljw Luc Waeijen; Erkan Diken; M Mattia Fiumara; Gerard K. Rauwerda; Henk Corporaal; T Tong Geng

The use of a wide Single-Instruction-Multiple-Data (SIMD) architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, based on our design framework for low-power SIMD processors, we propose a multiply-accumulate (MAC) unit with variable number of accumulator registers. The proposed MAC unit exploits both the merits of merged operation and register tiling. A Convolutional Neural Network (CNN) is a popular learning based algorithm due to its flexibility and high accuracy. However, a CNN-based application is often computationally intensive as it applies convolution operations extensively on a large data set. In this work, a CNN-based intelligent learning application is analyzed and mapped in the context of SIMD architectures. Experimental results show that the proposed architecture is efficient. In a 64-PE instance, the proposed SIMD processor with MAC4reg achieves an effective performance of 63.2 GOPS. Compared to the two baseline SIMD processors without MAC4reg, the proposed design brings 54.0% and 32.1% reduction in execution time, and 20.5% and 35.1% reduction in energy consumption respectively.

Archive | 2011