Tom Vander Aa
Katholieke Universiteit Leuven
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Tom Vander Aa.
field-programmable logic and applications | 2003
Francisco Barat; Murali Jayapala; Tom Vander Aa; Rudy Lauwereins; Geert Deconinck; Henk Corporaal
Current embedded multimedia applications have stringent time and power constraints. Coarse-grained reconfigurable processors have been shown to achieve the required performance. However, there is not much research regarding the power consumption of such processors. In this paper, we present a novel coarse-grained reconfigurable processor and study its power consumption using a power model derived from Wattch. Several processor configurations are evaluated using a set of multimedia applications. Results show that the presented coarse-grained processor can achieve on average 2.5x the performance of a RISC processor with an 18% increase in energy consumption.
application-specific systems, architectures, and processors | 2005
Andy Lambrechts; Praveen Raghavan; Anthony Leroy; Guillermo Talavera; Tom Vander Aa; Murali Jayapala; Francky Catthoor; Diederik Verkest; Geert Deconinck; Henk Corporaal; Frédéric Robert; Jordi Carrabina
Users expect future handheld devices to provide extended multimedia functionality and have long battery life. This type of application imposes heavy constraints on performance and power consumption and forces designers to optimize all parts of their platform. Evaluating the overall platform power breakdown is therefore critical to determine where to spend the efforts on power optimization. Surprisingly, few studies exist on that topic and decisions generally rely on common belief. We have realized a complete power breakdown for a realistic platform to identify the major power bottlenecks. This paper presents this power assessment of a realistic heterogeneous network on chip platform including processors, network and data/instruction memory hierarchy, running a video processing chain from camera to display. Our power breakdown identifies the main bottlenecks in the memory hierarchy and the foreground memory, and shows that global interconnect is not that critical for a well-optimized application mapping.
asia and south pacific design automation conference | 2004
Tom Vander Aa; Murali Jayapala; Francisco Barat; Geert Deconinck; Rudy Lauwereins; Francky Catthoor; Henk Corporaal
For multimedia applications, loop buffering is an efficient mechanism to reduce the power in the instruction memory of embedded processors. In particular, software controlled clustered loop buffers are energy efficient. However current compilers for VLIW do not fully exploit the potentials offered by such a clustered organization This paper presents an algorithm to explore what is the optimal loop buffer configuration and the optimal way to use this configuration for an application or a set of applications. Results for the MediaBench application suite show an additional 18% reduction (on average) in energy in the instruction memory hierarchy as compared to traditional non-clustered approaches to the loop buffer without compromising performance.
languages, compilers, and tools for embedded systems | 2008
Bjorn De Sutter; Paul Coene; Tom Vander Aa; Bingfeng Mei
DSP architectures often feature multiple register files with sparse connections to a large set of ALUs. For such DSPs, traditional register allocation algorithms suffer from a lot of problems, including a lack of retargetability and phase-ordering problems. This paper studies alternative register allocation techniques based on placement and routing. Different register file models are studied and evaluated on a state-of-the art coarse-grained reconfigurable array DSP, together with a new post-pass register allocator for rotating register files.
application specific systems architectures and processors | 2008
Andres Garcia; Mladen Berekovic; Tom Vander Aa
Coarse-Grained reconfigurable architectures are emerging as potential candidates to meet the high performance, power efficiency and flexibility needed by embedded systems. ADRES (Architecture for Dynamically Reconfigurable Embedded Systems) and its DRESC compiler offer a very promising platform for designing embedded systems targeted for different application domains. We present a procedure for mapping the widely used AES cryptographic algorithm on ADRES. A detailed explanation is shown for each of the optimizations performed in order to make better use of instruction and loop parallelism. A new intrinsic function set is proposed for speeding up the processing of the AES algorithm. The obtained simulation results are compared with experiments done on the widely known Texas Instruments DSP: TI C64x, which is considered state-of-the-art for embedded systems. Our results show that ADRES outperforms TI C64x DSP, executing the AES algorithm in one fourth of the cycles.
symposium on application specific processors | 2011
Tom Vander Aa; Martin Palkovic; Matthias Hartmann; Praveen Raghavan; Antoine Dejonghe; Liesbet Van der Perre
Throughput of wireless communication standards ever increases. Computation requirements for systems implementing those standards increase even more. On battery operated devices, next to high performance a low power implementation is also crucial. Reaching this is only possible by utilizing parallelizations at all levels. The ADRES processor is an embedded coarse-grained reconfigurable baseband processor that already could exploit Data Level Parallelism (DLP), Instruction Level Parallelism (ILP) efficiently. In this paper we present extensions to ADRES to also exploit Task Level Parallelism (TLP) efficiently. We show how we reduce the overhead in communication and synchronization between tasks and demonstrate this on a mapping of an 802.11n 300Mbps standard.
power and timing modeling optimization and simulation | 2005
Tom Vander Aa; Murali Jayapala; Francisco Barat; Geert Deconinck; Rudy Lauwereins; Henk Corporaal; Francky Catthoor
For multimedia applications, loop buffering is an efficient mechanism to reduce the power in the instruction memory of embedded processors. Especially software controlled loop buffers are energy efficient. However current compilers do not fully take advantage of the possibilities of such loop buffers. This paper presents an algorithm to explore for an application or a set of applications what is the optimal loop buffer configuration and the optimal way to use this configuration. Results for the MediaBench application suite show an additional 35% reduction (on average) in energy in the instruction memory hierarchy as compared to traditional approaches to the loop buffer without any performance implications.
real-time systems symposium | 2004
Andy Lambrechts; Tom Vander Aa; Murali Jayapala; Guillermo Talavera; Anthony Leroy; Adelina Shickova; Francisco Barat; Bingfeng Mei; Francky Catthoor; Diederik Verkest; Geert Deconinck; Henk Corporaal; Frédéric Robert; Jordi Carrabina Bordoll
Users expect future handheld devices to provide extended multimedia functionality and have long battery life. This type of application imposes heavy constraints on both (realtime) performance and energy consumption and forces designers to optimise all parts of their platform. In this experiment we focus on the different processor core design options for embedded platforms, including the effect of instruction memory hierarchy on the energy consumption. The results show that significant improvements for energy efficiency and/or performance over currently used RISC or VLIW processors can be achieved. We conclude, based on concrete data for a realistic application, that different styles, including both configurable hardware and instruction set processors, find their way into heterogeneous platforms and designers need to be aware of the trade-offs. Secondly, we show for the same application task that a heavily optimised instruction/configuration memory hierarchy can significantly reduce the energy consumption of this part, so it forms a crucial part of every energy aware design.
signal processing systems | 2010
Bjorn De Sutter; Osman Allam; Praveen Raghavan; Roeland Vandebriel; Hans Cappelle; Tom Vander Aa; Bingfeng Mei
This paper presents a memory organization for SDR inner modem baseband processors that focus on exploiting ILP. This memory organization uses power-efficient, single-ported, interleaved scratch-pad memory banks to provide enough bandwidth to a high-ILP processors. A system of queues in the memory interface is used to resolve bank conflicts among the single-ported banks, and to spread long bursts of conflicting accesses to the same bank over time. Bank address rotation is used to spread long bursts of conflicting accesses over multiple banks. All proposed techniques have been implemented in hardware, and are evaluated for a number of different wireless communication standards. For the 11a|n benchmarks, the overhead of stall cycles resulting from unresolved bank conflicts can be reduced to below 2% with the proposed organization. For 3GPP-LTE, the most demanding wireless standard we evaluated, the overhead is reduced to less than 0.13%. This is achieved with little energy and area overhead, and without any bank-aware compiler support.
ACM Transactions on Design Automation of Electronic Systems | 2015
Namita Sharma; Preeti Ranjan Panda; Francky Catthoor; Praveen Raghavan; Tom Vander Aa
Optimizations related to memory accesses and data storage make a significant difference to the performance and energy of a wide range of data-intensive applications. These techniques need to evolve with modern architectures supporting wide memory accesses. We investigate array interleaving, a data layout transformation technique that achieves energy efficiency by combining the storage of data elements from multiple arrays in contiguous locations, in an attempt to exploit spatial locality. The transformation reduces the number of memory accesses by loading the right set of data into vector registers, thereby minimizing redundant memory fetches. We perform a global analysis of array accesses, and account for possibly different array behavior in different loop nests that might ultimately lead to changes in data layout decisions for the same array across program regions. Our technique relies on detailed estimates of the savings due to interleaving, and also the cost of performing the actual data layout modifications. We also account for the vector register widths and the possibility of choosing the appropriate granularity for interleaving. Experiments on several benchmarks show a 6--34% reduction in memory energy due to the strategy.