Henk Corporaal
Eindhoven University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Henk Corporaal.
design automation conference | 2007
Sander Sander Stuijk; Aa Twan Basten; Mcw Marc Geilen; Henk Corporaal
Embedded multimedia systems often run multiple time-constrained applications simultaneously. These systems use multiprocessor systems-on-chip of which it must be guaranteed that enough resources are available for each application to meet its throughput constraints. This requires a task binding and scheduling mechanism that provides timing guarantees for each application independent of other applications while taking into account the available processor space, memory and communication bandwidth. Synchronous dataflow graphs (SDFGs) are used to model time-constrained multimedia applications. They allow modeling of cyclic, multi- rate dependencies between tasks. However, existing resource allocation techniques can only deal with acyclic and/or single-rate dependencies. Dependencies in an SDFG can be expressed in single-rate form, but then the problem size may increase exponentially making resource allocation infeasible. This paper presents a new resource allocation strategy which works directly on SDFGs, building on an efficient technique to calculate throughput of a bound and scheduled SDFG. Experimental results show that the strategy is effective in terms of run-time and allocated resources.
ACM Transactions on Design Automation of Electronic Systems | 2009
Stefan Valentin Gheorghita; Martin Palkovic; Juan Hamers; Arnout Vandecappelle; Stelios Mamagkakis; Twan Basten; Lieven Eeckhout; Henk Corporaal; Francky Catthoor; Frederik Vandeputte; Koen De Bosschere
In the past decade, real-time embedded systems have become much more complex due to the introduction of a lot of new functionality in one application, and due to running multiple applications concurrently. This increases the dynamic nature of todays applications and systems, and tightens the requirements for their constraints in terms of deadlines and energy consumption. State-of-the-art design methodologies try to cope with these novel issues by identifying several most used cases and dealing with them separately, reducing the newly introduced complexity. This article presents a generic and systematic design-time/run-time methodology for handling the dynamic nature of modern embedded systems, which can be utilized by existing design methodologies to increase their efficiency. It is based on the concept of system scenarios, which group system behaviors that are similar from a multidimensional cost perspective—such as resource requirements, delay, and energy consumption—in such a way that the system can be configured to exploit this cost similarity. At design-time, these scenarios are individually optimized. Mechanisms for predicting the current scenario at run-time, and for switching between scenarios, are also derived. This design trajectory is augmented with a run-time calibration mechanism, which allows the system to learn on-the-fly during its execution, and to adapt itself to the current input stimuli, by extending the scenario set, changing the scenario definitions, and both the prediction and switching mechanisms. To show the generality of our methodology, we show how it has been applied on four very different real-life design problems. In all presented case studies, substantial energy reductions were obtained by exploiting scenarios.
conference on high performance computing (supercomputing) | 1991
Henk Corporaal; Hans Mulder
No abstract available
international conference on computer design | 2013
Mcj Maurice Peemen; Aaa Setio; B Bart Mesman; Henk Corporaal
In the near future, cameras will be used everywhere as flexible sensors for numerous applications. For mobility and privacy reasons, the required image processing should be local on embedded computer platforms with performance requirements and energy constraints. Dedicated acceleration of Convolutional Neural Networks (CNN) can achieve these targets with enough flexibility to perform multiple vision tasks. A challenging problem for the design of efficient accelerators is the limited amount of external memory bandwidth. We show that the effects of the memory bottleneck can be reduced by a flexible memory hierarchy that supports the complex data access patterns in CNN workload. The efficiency of the on-chip memories is maximized by our scheduler that uses tiling to optimize for data locality. Our design flow ensures that on-chip memory size is minimized, which reduces area and energy usage. The design flow is evaluated by a High Level Synthesis implementation on a Virtex 6 FPGA board. Compared to accelerators with standard scratchpad memories the FPGA resources can be reduced up to 13× while maintaining the same performance. Alternatively, when the same amount of FPGA resources is used our accelerators are up to 11× faster.
design, automation, and test in europe | 2003
Erik Brockmeyer; Miguel Miranda; Henk Corporaal; Francky Catthoor
Nearly all platforms use a multi-layer memory hierarchy to bridge the enormous latency gap between the large off-chip memories and local register files. However, most of the previous work on HW or SW controlled techniques for layer assignment have been mainly focussed on performance. As a result, the intermediate layers have been assigned too large sizes, leading to energy inefficiency. In this paper, we present a technique that takes advantage of both the temporal locality and limited lifetime of the arrays of the application for minimum energy consumption under layer size constraints. A prototype tool has been developed and tested using two real-life applications of industrial relevance. Following this approach, we have been able to halve the energy consumed by the memory hierarchy for each of our drivers.
Code Generation for Embedded Processors | 1996
Henk Corporaal; Jan Hoogerbrugge
Transport triggered architectures (TTAs) form a class of architectures which are programmed by specifying data transports between function units. As side effect of these data transports these function units perform operations. Making these data transports visible at the architectural level contributes to the flexibility and scalability of processors. Furthermore it enables several extra code scheduling optimizations. These properties make TTAs very suitable for being applied for embedded processors.
IEEE Journal of Solid-state Circuits | 2010
Yu Pu; J. Pineda de Gyvez; Henk Corporaal; Yajun Ha
We present a design technique for (near) subthreshold operation that achieves ultra low energy dissipation at throughputs of up to 100 MB/s suitable for digital consumer electronic applications. Our approach employs i) architecture-level parallelism to compensate throughput degradation, ii) a configurable V T balancer to mitigate the V T mismatch of nMOS and pMOS transistors operating in sub/near threshold, and iii) a fingered-structured parallel transistor that exploits V T mismatch to improve current drivability. Additionally, we describe the selection procedure of the standard cells and how they were modified for higher reliability in the subthreshold regime. All these concepts are demonstrated using SubJPEG, a 1.4 ×1.4 mm2 65 nm CMOS standard-V T multi-standard JPEG co-processor. Measurement results of the discrete cosine transform (DCT) and quantization processing engines, operating in the subthreshold regime, show an energy dissipation of only 0.75 pJ per cycle with a supply voltage of 0.4 V at 2.5 MHz. This leads to 8.3× energy reduction when compared to using a 1.2 V nominal supply. In the near-threshold regime the energy dissipation is 1.0 pJ per cycle with a 0.45 V supply voltage at 4.5 MHz. The system throughput can meet 15 fps 640 × 480 pixel VGA standard. Our methodology is largely applicable to designing other sound/graphic and streaming processors.
design, automation, and test in europe | 2007
Akash Kumar; Ma Andreas Hansson; Jos Huisken; Henk Corporaal
Multi-processor systems on chip (MPSoC) platforms are becoming increasingly more heterogeneous and are shifting towards a more communication-centric methodology. Networks on chip (NoC) have emerged as the design paradigm for scalable on-chip communication architectures. As the system complexity grows, the problem emerges as how to design and instantiate such a NoC-based MPSoC platform in a systematic and automated way. This paper presents an integrated flow to automatically generate a highly configurable NoC-based MPSoC for FPGA instantiation. The system specification is done on a high level of abstraction, relieving the designer of error-prone and time consuming work. The flow uses the state-of-the-art /Ethereal NoC, and silicon hive processing cores, both configurable at design- and run-time. The authors use this flow to generate a range of sample designs whose functionality has been verified on a Celoxica RC300E development board. The board, equipped with a Xilinx Virtex II 6000, also offers a huge number of peripherals, and shows how the insertion is automated in the design for easy debugging and prototyping
international symposium on system-on-chip | 2006
Ch. Ykman-Couvreur; Vincent Nollet; F. Catthoor; Henk Corporaal
Since the application complexity is growing and applications can be dynamically activated, the major challenge for heterogeneous multi-processor platforms is to select at run time an energy-efficient mapping of these applications. Taking into account that many different possible implementations per application can be available, and that the selection must meet the application deadlines under the available platform resources, this optimization problem can be modeled as a Multi-dimension Multi-choice Knapsack Problem (MMKP), being NP-hard. Not only algorithms for exact solution, but also state-of-the-art heuristics for real-time systems, are still too slow for run-time management of multi-procesor platforms. This paper provides a new greedy heuristic for finding near-optimal solutions of the MMKP, being fast enough for the considered environment. The main contribution of this heuristic is: (1) the derivation of the Pareto sets from the initial MMKP to reduce the search space, (2) the sorting of all Pareto points together in a single two-dimension search space, where (3) a fast greedy algorithm solves the MMKP. Experiments show that our heuristic finds solutions close to the ones obtained by the fastest state-of-the-art heuristics (within 0% to 0.4% of the solution value), in just a fraction of the execution time (more than 97.5% gain on a StrongARM processor) and can run in less than lms for multi-processor problem sizes
ACM Transactions on Design Automation of Electronic Systems | 2008
Akash Kumar; Shakith Fernando; Yajun Ha; B Bart Mesman; Henk Corporaal
Future applications for embedded systems demand chip multiprocessor designs to meet real-time deadlines. The large number of applications in these systems generates an exponential number of use-cases. The key design automation challenges are designing systems for these use-cases and fast exploration of software and hardware implementation alternatives with accurate performance evaluation of these use-cases. These challenges cannot be overcome by current design methodologies which are semiautomated, time consuming, and error prone. In this article, we present a design methodology to generate multiprocessor systems in a systematic and fully automated way for multiple use-cases. Techniques are presented to merge multiple use-cases into one hardware design to minimize cost and design time, making it well suited for fast design-space exploration (DSE) in MPSoC systems. Heuristics to partition use-cases are also presented such that each partition can fit in an FPGA, and all use-cases can be catered for. The proposed methodology is implemented into a tool for Xilinx FPGAs for evaluation. The tool is also made available online for the benefit of the research community and is used to carry out a DSE case study with multiple use-cases of real-life applications: H263 and JPEG decoders. The generation of the entire design takes about 100 ms, and the whole DSE was completed in 45 minutes, including FPGA mapping and synthesis. The heuristics used for use-case partitioning reduce the design-exploration time elevenfold in a case study with mobile-phone applications.