Is this you? Create Your Porfile

B Bart Mesman

Eindhoven University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where B Bart Mesman is active.

Explore More

Publication

Featured researches published by B Bart Mesman.

international conference on computer design | 2013

Memory-centric accelerator design for Convolutional Neural Networks

Mcj Maurice Peemen; Aaa Setio; B Bart Mesman; Henk Corporaal

In the near future, cameras will be used everywhere as flexible sensors for numerous applications. For mobility and privacy reasons, the required image processing should be local on embedded computer platforms with performance requirements and energy constraints. Dedicated acceleration of Convolutional Neural Networks (CNN) can achieve these targets with enough flexibility to perform multiple vision tasks. A challenging problem for the design of efficient accelerators is the limited amount of external memory bandwidth. We show that the effects of the memory bottleneck can be reduced by a flexible memory hierarchy that supports the complex data access patterns in CNN workload. The efficiency of the on-chip memories is maximized by our scheduler that uses tiling to optimize for data locality. Our design flow ensures that on-chip memory size is minimized, which reduces area and energy usage. The design flow is evaluated by a High Level Synthesis implementation on a Virtex 6 FPGA board. Compared to accelerators with standard scratchpad memories the FPGA resources can be reduced up to 13× while maintaining the same performance. Alternatively, when the same amount of FPGA resources is used our accelerators are up to 11× faster.

compilers, architecture, and synthesis for embedded systems | 2003

Task-level timing models for guaranteed performance in multiprocessor networks-on-chip

P. Poplavko; Twan Basten; Mjg Marco Bekooij; J. van Meerbergen; B Bart Mesman

We consider a dynamic application running on a multiprocessor network-on-chip as a set of independent jobs, each job possibly running on multiple processors. To provide guaranteed quality and performance, the scheduling of jobs, jobs themselves and the hardware must be amenable to timing analysis. For a certain class of applications and multiprocessor architectures, we propose exact timing models that effectively co-model both the computation and communication of a job. The models are based on interprocessor communication (IPC) graphs [4]. Our main contribution is a precise model of network-on-chip communication, including buffer models. We use a JPEG-decoder job as an example to demonstrate that our models can be used in practice to derive upper bounds on the job execution time and to reason about optimal buffer sizes.

ACM Transactions on Design Automation of Electronic Systems | 2008

Multiprocessor systems synthesis for multiple use-cases of multiple applications on FPGA

Akash Kumar; Shakith Fernando; Yajun Ha; B Bart Mesman; Henk Corporaal

Future applications for embedded systems demand chip multiprocessor designs to meet real-time deadlines. The large number of applications in these systems generates an exponential number of use-cases. The key design automation challenges are designing systems for these use-cases and fast exploration of software and hardware implementation alternatives with accurate performance evaluation of these use-cases. These challenges cannot be overcome by current design methodologies which are semiautomated, time consuming, and error prone. In this article, we present a design methodology to generate multiprocessor systems in a systematic and fully automated way for multiple use-cases. Techniques are presented to merge multiple use-cases into one hardware design to minimize cost and design time, making it well suited for fast design-space exploration (DSE) in MPSoC systems. Heuristics to partition use-cases are also presented such that each partition can fit in an FPGA, and all use-cases can be catered for. The proposed methodology is implemented into a tool for Xilinx FPGAs for evaluation. The tool is also made available online for the benefit of the research community and is used to carry out a DSE case study with multiple use-cases of real-life applications: H263 and JPEG decoders. The generation of the entire design takes about 100 ms, and the whole DSE was completed in 45 minutes, including FPGA mapping and synthesis. The heuristics used for use-case partitioning reduce the design-exploration time elevenfold in a case study with mobile-phone applications.

general purpose processing on graphics processing units | 2011

High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs

C Cedric Nugteren; Gjw Gert-Jan van den Braak; Henk Corporaal; B Bart Mesman

Graphics Processing Units (GPUs) are suitable for highly data parallel algorithms such as image processing, due to their massive parallel processing power. Many image processing applications use the histogramming algorithm, which fills a set of bins according to the frequency of occurrence of pixel values taken from an input image. Histogramming has been mapped on a GPU prior to this work. Although significant research effort has been spent in optimizing the mapping, we show that the performance and performance predictability of existing methods can still be improved. In this paper, we present two novel histogramming methods, both achieving a higher performance and predictability than existing methods. We discuss performance limitations for both novel methods by exploring algorithm trade-offs. Both the novel and the existing histogramming methods are evaluated for performance. The first novel method gives an average performance increase of 33% over existing methods for non-synthetic benchmarks. The second novel method gives an average performance increase of 56% over existing methods and guarantees to be fully data independent. While the second method is specifically designed for newer GPU architectures, the first method is also suitable for older architectures.

embedded software | 2010

CA-MPSoC: An automated design flow for predictable multi-processor architectures for multiple applications

A Ahsan Shabbir; Akash Kumar; Sander Sander Stuijk; B Bart Mesman; Henk Corporaal

Future embedded systems demand multi-processor designs to meet real-time deadlines. The large number of applications in these systems generates an exponential number of use-cases. The key design automation challenges are designing systems for these use-cases and fast exploration of software and hardware implementation alternatives with accurate performance evaluation of these use-cases. These challenges cannot be overcome by current design methodologies which are semi-automated, time consuming and error prone. In this paper, we present a fully automated design flow to generate communication assist (CA) based multi-processor systems (CA-MPSoC). A worst-case performance model of our CA is proposed so that the performance of the CA-based platform can be analyzed before its implementation. The design flow provides performance estimates and timing guarantees for both hard real-time and soft real-time applications, provided the task to processor mappings are given by the user. The flow automatically generates a super-set hardware that can be used in all use-cases of the applications. The software for each of these use-cases is also generated including the configuration of communication architecture and interfacing with application tasks. CA-MPSoC has been implemented on Xilinx FPGAs for evaluation. Further, it is made available on-line for the benefit of the research community and in this paper, it is used for performance analysis of two real life applications, Sobel and JPEG encoder executing concurrently. The CA-based platform generated by our design flow records a maximum error of 3.4% between analyzed and measured periods. Our tool can also merge use-cases to generate a super-set hardware which accelerates the evaluation of these use-cases. In a case study with six applications, the use-case merging results in a speed up of 18 when compared to the case where each use-case is evaluated individually.

international conference on embedded computer systems: architectures, modeling, and simulation | 2011

MOVE-Pro: A low power and high code density TTA architecture

Y Yifan He; D Dongrui She; B Bart Mesman; Henk Corporaal

Transport Triggered Architectures (TTAs) possess many advantageous, such as modularity, flexibility, and scalability. As an exposed datapath architecture, TTAs can effectively reduce the register file (RF) pressure in both number of accesses and number of RF ports. However, the conventional TTAs also have some evident disadvantages, such as relative low code density, dynamic-power wasting due to separate scheduling of source operands, and inefficient support for variant immediate values. In order to preserve the merit of conventional TTAs, while solving these aforementioned issues, we propose, MOVE-Pro, a novel low power and high code density TTA architecture. With optimizations at instruction set architecture (ISA), architecture, circuit, and compiler levels, the low-power potential of TTAs is fully exploited. Moreover, with a much denser code size, TTAs performance is also improved accordingly. In a head-to-head comparison between a two-issue MOVE-Pro processor and its RISC counterpart, we shown that up to 80% of RF accesses can be reduced, and the reduction in RF power is successfully transferred to the total core power saving. Up to 11% reduction of the total core power is achieved by our MOVE-Pro processor, while the code density is almost the same as its RISC counterpart.

embedded software | 2008

Analyzing composability of applications on MPSoC platforms

Akash Kumar; B Bart Mesman; Bd Bart Theelen; Henk Corporaal; Yajun Ha

Modern day applications require use of multi-processor systems for reasons of performance, scalability and power efficiency. As more and more applications are integrated in a single system, mapping and analyzing them on a multi-processor platform becomes a multi-dimensional problem. Each possible set of applications that can be concurrently active leads to a different use-case (also referred to as scenario) that the system has to be verified and tested for. Analyzing the feasibility and resource utilization of all possible use-cases becomes very demanding and often infeasible. Therefore, in this paper, we highlight this issue of being able to analyze applications in isolation while still being able to reason about their overall behavior - also called composability. We make a number of novel observations about how arbitration plays an important role in system behavior. We compare two commonly used arbitration mechanisms, and highlight the properties that are important for such analysis. We conclude that none of these arbitration mechanisms provide the necessary features for analysis. They either suffer from scalability problems, or provide unreasonable estimates about performance, leading to waste of resources and/or undesirable performance. We further propose to use a Resource Manager (RM) to ensure applications meet their performance requirements. The basic functionalities of such a component are introduced. A high-level simulation model is developed to study the performance of RM, and a case study is performed for a system running an H.263 and a JPEG decoder. The case study illustrates at what granularity of control a resource manager can effectively regulate the progress of applications such that they meet their performance requirements.

advanced concepts for intelligent vision systems | 2011

Fast hough transform on GPUs: exploration of algorithm trade-offs

Gjw Gert-Jan van den Braak; C Cedric Nugteren; B Bart Mesman; Henk Corporaal

The Hough transform is a commonly used algorithm to detect lines and other features in images. It is robust to noise and occlusion, but has a large computational cost. This paper introduces two new implementations of the Hough transform for lines on a GPU. One focuses on minimizing processing time, while the other has an input-data independent processing time. Our results show that optimizing the GPU code for speed can achieve a speed-up over naive GPU code of about 10×. The implementation which focuses on processing speed is the faster one for most images, but the implementation which achieves a constant processing time is quicker for about 20% of the images.

embedded systems for real-time multimedia | 2006

Resource Manager for Non-preemptive Heterogeneous Multiprocessor System-on-chip

Akash Kumar; B Bart Mesman; Bd Bart Theelen; Henk Corporaal; Ha Yajun

Increasingly more MPSoC platforms are being developed to meet the rising demands from concurrently executing applications. These systems are often heterogeneous with the use of dedicated IP blocks and application domain specific processors. While there is a host of research done to provide good performance guarantees and to analyze applications for preemptive uniprocessor systems, the field of heterogeneous, non-preemptive MPSoCs is a mostly unexplored territory. In this paper, we propose to use a resource manager (RM) to improve the resource utilization of these systems. The basic functionalities of such a component are introduced. A high-level simulation model of such a system is developed to study the performance of RM, and a case study is performed for a system running an H.263 and a JPEG decoder. The case study illustrates at what control granularity a resource manager can effectively regulate the progress of applications such that they meet their performance requirements

ACM Transactions on Design Automation of Electronic Systems | 2000

Constraint analysis for code generation: basic techniques and applications in FACTS

Koen van Eijk; B Bart Mesman; Carlos A. Alba Pinto; Qin Zhao; Marco Jan Gerrit Bekooij; Jef L. van Meerbergen; Jochen A. G. Jess

Code generation methods for digital signal processors are increasingly hampered by the combination of tight timing constraints imposed by signal p processing applications and resource constraints implied by the processor architecture. In particular, limited resource availability (e.g.registers) poses a problem for traditional methods that perform code generation in separate stages (e.g., scheduling followed by register binding). This separation often results in suboptimality (or even infeasibility) of the generated solutions because it ignores the problem of phase coupling (e.g., since value lifetimes are a result of scheduling, scheduling affects the solution space for register binding). As a result, traditional methods need an increasing amount of help from the programmer (or designer) to arrive at a feasible solution. Because this requires an excessive amount of design time and extensive knowledge of the processor architecture, there is a need for automated techniques that can cope with the different kinds of contraints during scheduling. By exploiting these constraints to prune the schedule search space, the scheduler is often prevented from making a decision that inevitably violates one or more constraints. FACTS is a research tool developed for this purpose. In this paper we will elucidate the philosophy and concepts of FACTS and demonstrate them on a number of examples.

Explore More