Lars Bauer
Karlsruhe Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lars Bauer.
design automation conference | 2013
Jörg Henkel; Lars Bauer; Nikil D. Dutt; Puneet Gupta; Sani R. Nassif; Muhammad Shafique; Mehdi Baradaran Tahoori; Norbert Wehn
Reliability concerns due to technology scaling have been a major focus of researchers and designers for several technology nodes. Therefore, many new techniques for enhancing and optimizing reliability have emerged particularly within the last five to ten years. This perspective paper introduces the most prominent reliability concerns from todays points of view and roughly recapitulates the progress in the community so far. The focus of this paper is on perspective trends from the industrial as well as academic points of view that suggest a way for coping with reliability challenges in upcoming technology nodes.
international conference on hardware/software codesign and system synthesis | 2011
Sebastian Kobbe; Lars Bauer; Daniel Lohmann; Jörg Henkel
The trend towards many-core systems comes with various issues, among them their highly dynamic and non-predictable workloads. Hence, new paradigms for managing resources of many-core systems are of paramount importance. The problem of resource management, e.g. mapping applications to processor cores, is NP-hard though, requiring heuristics especially when performed online. In this paper, we therefore present a novel resource-management scheme that supports so-called malleable applications. These applications can adopt their level of parallelism to the assigned resources. By design, our (decentralized) scheme is scalable and it copes with the computational complexity by focusing on local decision-making. Our simulations show that the quality of the mapping decisions of our approach is able to stay near the mapping quality of state-of-the-art (i.e. centralized) online schemes for malleable applications but at a reduced overall communication overhead (only about 12,75% on a 1024 core system with a total workload of 32 multi-threaded applications). In addition, our approach is scalable as opposed to a centralized scheme and therefore it is practically useful for employment in large many-core systems as our extensive studies and experiments show.
asia and south pacific design automation conference | 2012
Jörg Henkel; Andreas Herkersdorf; Lars Bauer; Thomas Wild; Michael Hübner; Ravi Kumar Pujari; Artjom Grudnitsky; Jan Heisswolf; Aurang Zaib; Benjamin Vogel; Vahid Lari; Sebastian Kobbe
This paper introduces a scalable hardware and software platform applicable for demonstrating the benefits of the invasive computing paradigm. The hardware architecture consists of a heterogeneous, tile-based manycore structure while the software architecture comprises a multi-agent management layer underpinned by distributed runtime and OS services. The necessity for invasive-specific hardware assist functions is analytically shown and their integration into the overall manycore environment is described.
design, automation, and test in europe | 2010
Ralf Koenig; Lars Bauer; Timo Stripf; Muhammad Shafique; Waheed Ahmed; Juergen Becker; Jörg Henkel
Facing the requirements of next generation applications, current approaches of embedded systems design will soon hit the limit where they may no longer perform efficiently. The unpredictable nature and diverse processing behavior of future applications requires to transgress the barrier of tailor-made, application-/domain-specific embedded system designs. As a consequence, next generation architectures for embedded systems have to react much more flexible to unforeseeable run-time scenarios. In this paper we present our innovative processor architecture concept KAHRISMA (KArlsruhes Hypermorphic Reconfigurable-Instruction-Set Multi-grained-Array). It tightly integrates coarse- and fine-grained run-time reconfigurable fabrics that can incorporate to realize hardware acceleration for computationally complex algorithms. Furthermore, the fabrics can be combined to realize different Instruction Set Architectures that may execute in parallel. With the help of an encrypted H.264 en-/decoding case study we demonstrate that our novel KAHRISMA architecture will deliver the required flexibility to design future-proof embedded systems that are not limited to a certain computational domain.
design automation conference | 2008
Lars Bauer; Muhammad Shafique; Jörg Henkel
We are presenting a new concept of an application-specific processor that is capable of transmuting its instruction set according to non-predictive application behavior during run-time. In those scenarios, current (extensible) embedded processors are less efficient since they are not run-time adaptive. We have identified the instruction set selection to be a critical step to perform at run time and hence we focus this paper on that crucial part. Our paradigm conducts as many steps as possible at compile/design time and as little as necessary at run time with the constraint to provide a sufficient flexibility to react to non-predictive application behavior efficiently We provide an in-depth analysis of our scheme and achieve a speed-up of up to 7.19times (average: 3.63times) compared to state-of-the-art adaptive approaches (like [19]). As an application, we have employed a whole H.264 video encoder though our scheme is by principle applicable to many other embedded applications. Our results are evaluated by an implementation of the instruction set selection for our transmutable processor on an FPGA platform.
design, automation, and test in europe | 2010
Muhammad Shafique; Lars Bauer; Jörg Henkel
The limited energy resources in portable multimedia devices require the reduction of encoding complexity. The complex Motion Estimation (ME) scheme of H.264/MPEG-4 AVC accounts for a major part of the encoder energy. In this paper we present a Run-Time Adaptive Predictive Energy Budgeting (enBudget) scheme for energy-aware ME that predicts the energy budget for different video frames and different Macroblocks (MBs) in an adaptive manner considering the run-time changing scenarios of available energy, video frame characteristics, and user-defined coding constraints while keeping a good video quality. It assigns different Energy-Quality Classes to different video frames and fine-tunes at MB level depending upon the predictive energy quota in order to cope with above-mentioned run-time unpredictable scenarios. Compared to UMHexagonS, EPZS, and FastME, our enBudget scheme for energy-aware ME achieves an energy saving of up to 93%, 90%, 88% (average 88%, 77%, 66%), respectively. It suffers from an average Peak Signal to Noise Ratio (PSNR) loss of 0.29 dB compared to Full Search. We also demonstrate that enBudget is equally beneficial to various other state-of-the-art fast adaptive MEs (e.g.). We have evaluated our scheme for ASIC and various FPGAs.
design automation conference | 2014
Jörg Henkel; Lars Bauer; Hongyan Zhang; Semeen Rehman; Muhammad Shafique
We show in this paper that multi-layer dependability is an indispensable way to cope with the increasing amount of technology-induced dependability problems that threaten to proceed further scaling. We introduce the definition of multi-layer dependability and present our design flow within this paradigm that seamlessly integrates techniques starting at circuit layer all the way up to application layer and thereby accounting for ASIC-based architectures as well as for reconfigurable-based architectures. At the end, we give evidence that the paradigm of multi-layer dependability bears a large potential for significantly increasing dependability at reasonable effort.
field-programmable logic and applications | 2008
Lars Bauer; Muhammad Shafique; Jörg Henkel
Processors with a reconfigurable instruction set combine the performance of dedicated application accelerators with a flexibility that goes beyond that of traditional application specific instruction set processors (ASIPs). The latter are optimized for certain application domains and thus typically do not provide a high performance and/or efficiency when deployed in other domains. State-of-the-art reconfigurable processors on the other side still use the concept of monolithic Special Instructions (SIs, i.e. the application accelerators). In our work, we instead present modular SIs as a hierarchy of elementary data paths and different SI implementations that facilitate a high flexibility and performance. This is a novel concept that achieves a speedup of 26.6x compared to a general purpose processor and 1.24x compared to a state-of-the-art reconfigurable processor (that is statically optimized for the predetermined benchmark situation) when executing an H.264 video encoder. We introduce a novel infrastructure for computation and communication that actually enables the implementation of modular SIs and offers various parameters to match specific requirements. The infrastructure is implemented and tested on an FPGA-based prototype to demonstrate its feasibility.
signal processing systems | 2010
Muhammad Shafique; Lars Bauer; Jörg Henkel
The H.264/AVC video coding standard features diverse computational hot spots that need to be accelerated to cope with the significantly increased complexity compared to previous standards. In this paper, we propose an optimized application structure (i.e. the arrangement of functional components of an application determining the data flow properties) for the H.264 encoder which is suitable for application-specific and reconfigurable hardware platforms. Our proposed application structural optimization for the computational reduction of the Motion Compensated Interpolation is independent of the actual hardware platform that is used for execution. For a MIPS processor we achieve an average speedup of approximately 60× for Motion Compensated Interpolation. Our proposed application structure reduces the overhead for Reconfigurable Platforms by distributing the actual hardware requirements amongst the functional blocks. This increases the amount of available reconfigurable hardware per Special Instruction (within a functional block) which leads to a 2.84× performance improvement of the complete encoder when compared to a Benchmark Application with standard optimizations. We evaluate our application structure by means of four different hardware platforms.
compilers architecture and synthesis for embedded systems | 2013
Fazal Hameed; Lars Bauer; Jörg Henkel
Two key parameters that determine the performance of a DRAM cache based multi-core system are DRAM cache hit latency (HL) and DRAM cache miss rate (MR), as they strongly influence the average DRAM cache access latency. Recently proposed DRAM set mapping policies are either optimized for HL or for MR. None of these policies provides a good HL and MR at the same time. This paper presents a novel DRAM set mapping policy that simultaneously targets both parameters with the goal of achieving the best of both to reduce the overall DRAM cache access latency. For a 16-core system, our proposed set mapping policy reduces the average DRAM cache access latency (depends upon HL and MR) compared to state-of-the-art DRAM set mapping policies that are optimized for either HL or MR by 29.3% and 12.1%, respectively.