Rainer Buchty
Karlsruhe Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Rainer Buchty.
automation, robotics and control systems | 2004
Rainer Buchty; Nevin Heintze; Dino Oliva
Cryptographic methods are widely used within networking and digital rights management. Numerous algorithms exist, e.g. spanning VPNs or distributing sensitive data over a shared network infrastructure. While these algorithms can be run with moderate performance on general purpose processors, such processors do not meet typical embedded systems requirements (e.g. area, cost and power consumption). Instead, specialized cores dedicated to one or a combination of algorithms are typically used. These cores provide very high bandwidth data transmission and meet the needs of embedded systems. However, with such cores changing the algorithm is not possible without replacing the hardware. This paper describes a fully programmable processor architecture which has been tailored for the needs of a spectrum of cryptographic algorithms and has been explicitly designed to run at high clock rates while maintaining a significantly better performance/area/power tradeoff than general purpose processors. Both the architecture and instruction set have been developed to achieve a bits-per-clock rate of greater than one, even with complex algorithms. This performance will be demonstrated with standard cryptographic algorithms (AES and DES) and a widely used hash algorithm (MD5).
compilers, architecture, and synthesis for embedded systems | 2003
Dino Oliva; Rainer Buchty; Nevin Heintze
CRYPTONITE is a programmable processor tailored to the needs of crypto algorithms. The design of CRYPTONITE was based on an in-depth application analysis in which standard crypto algorithms (AES, DES, MD5, SHA-1, etc) were distilled down to their core functionality. We describe this methodology and use AES as a central example. Starting with a functional description of AES, we give a high level account of how to implement AES efficiently in hardware, and present several novel optimizations (which are independent of CRYPTONITE).We then describe the CRYPTONITE architecture, highlighting how AES implementation issues influenced the design of the processor and its instruction set. CRYPTONITE is designed to run at high clock rates and be easy to implement in silicon while providing a significantly better performance/area/power tradeoff than general purpose processors.
high assurance systems engineering | 2012
Boris Motruk; Jonas Diemer; Rainer Buchty; Rolf Ernst; Mladen Berekovic
On a multi- or many-core platform that runs applications of different safety criticality (mixed-criticality), all applications have to be certified to the highest level of criticality, unless they are sufficiently isolated. Isolation enables individual certification of applications and cost-efficient re-certification of single applications after an update. We introduce a parameterizable and synthesizable many-core platform with a fast and scalable monitoring and control mechanism that supports safe sharing of resources. Our contribution is a step towards exploiting the benefits of multi- and many-core platforms for mixed-critical applications.
high performance embedded architectures and compilers | 2011
Mario Kicherer; Rainer Buchty; Wolfgang Karl
Todays approaches towards heterogeneous computing rely on either the programmer or dedicated programming models to efficiently integrate heterogeneous components. In this work, we propose an adaptive cost-aware function-migration mechanism built on top of a light-weight hardware abstraction layer. With this mechanism, the highly dynamic task of choosing the most beneficial processing unit will be hidden from the programmer while causing only minor variation in the work and program flow. The migration mechanism transparently adapts to the current workload and system environment without the necessity of JIT compilation or binary translation. Evaluation shows that our approach successfully adapts to new circumstances and predicts the most beneficial processing unit (PU). Through fine-grained PU selection, our solution achieves a speedup of up to 2.27 for the average kernel execution time but introduces only a marginal overhead in case its services are not required.
Concurrency and Computation: Practice and Experience | 2012
Rainer Buchty; Vincent Heuveline; Wolfgang Karl; Jan-Philipp Weiss
In the last few years, the landscape of parallel computing has been subject to profound and highly dynamic changes. The paradigm shift towards multicore and manycore technologies coupled with accelerators in a heterogeneous environment is offering a great potential of computing power for scientific and industrial applications. However, for one to take full advantage of these new technologies, holistic approaches coupling the expertise ranging from hardware architecture and software design to numerical algorithms are a pressing necessity. Parallel computing is no longer limited to supercomputers and is now much more diversified – with a multitude of technologies, architectures, and programming approaches leading to increased complexity for developers and engineers.
reconfigurable communication centric systems on chip | 2014
Thomas Schuster; Rolf Meyer; Rainer Buchty; Luca Fossati; Mladen Berekovic
SoCRocket is a design framework for rapid SoC development. Emerged from an industrial case study for the European Space Agency (ESA) it enables design, verification and evaluation of multiprocessor platforms based on a collection of open and freely available building blocks, including the LEON processor and the RTEMS operating system. Moreover, it provides a modular and standard-compliant tool-set for the creation, configuration, simulation, and performance analysis of virtual platform prototypes, supporting mixed abstraction levels to balance simulation accuracy and speed. Based on state-of-the-art design specification languages such as SystemC/TLM2 all modules are available in three abstraction levels: loosely-timed (LT), approximately-timed (AT) and register-transfer-level (RTL). Hence, mixed-abstraction simulations and timing variants can be created quickly depending on the required use case. We apply SoCRocket to a proof-of-concept system, which is optimized for the lossless multi- & hyperspectral compression of satellite images in terms of number of cores and cache configurations, demonstrating the platforms capabilities for design-space exploration. The accuracy of the higher abstraction models is within a 90% range compared to RTL, while the simulation speedup reaches 1500X for our benchmarks.
high performance embedded architectures and compilers | 2012
Mario Kicherer; Fabian Nowak; Rainer Buchty; Wolfgang Karl
Nowadays, many possible configurations of heterogeneous systems exist, posing several new challenges to application development: different types of processing units usually require individual programming models with dedicated runtime systems and accompanying libraries. If these are absent on an end-user system, e.g. because the respective hardware is not present, an application linked against these will break. This handicaps portability of applications being developed on one system and executed on other, differently configured heterogeneous systems. Moreover, the individual profit of different processing units is normally not known in advance. In this work, we propose a technique to effectively decouple applications from their accelerator-specific parts, respectively code. These parts are only linked on demand and thereby an application can be made portable across systems with different accelerators. As there are usually multiple hardware-specific implementations for a certain task, e.g., a CPU and a GPU version, a method is required to determine which are usable at all and which one is most suitable for execution on the current system. With our approach, application and hardware programmers can express the requirements and the abilities of the application and the hardware-specific implementations in a simplified manner. During runtime, the requirements and abilities are compared with regard to the present hardware in order to determine the usable implementations of a task. If multiple implementations are usable, an online-learning history-based selector is employed to determine the most efficient one. We show that our approach chooses the fastest usable implementation dynamically on several systems while introducing only a negligible overhead itself. Applied to an MPI application, our mechanism enables exploitation of local accelerators on different heterogeneous hosts without preliminary knowledge or modification of the application.
international symposium on multimedia | 2007
Jie Tao; Asadollah Shahbahrami; Ben H. H. Juurlink; Rainer Buchty; Wolfgang Karl; Stamatis Vassiliadis
The 2D DWT consists of two 1D DWT in both directions: horizontal filtering processes the rows followed by vertical filtering processes the columns. It is well known that a straightforward implementation of the vertical filtering shows quite different performance with various working set sizes. The only reasonable explanation for this has to be the access behavior of the cache memory. As known, vertical filtering has mapping conflicts in the cache with a working set size that is power of two. However, it is not clear how this conflict forms and whether cache problems exist with other data sizes. Such knowledge is the base for efficient code optimization. In order to acquire this knowledge and to achieve more accurate optimization potentials, we apply a cache visualization tool to examine the runtime cache activities of the vertical implementation. We find that besides mapping conflicts, vertical filtering also shows a large number of capacity misses. More specifically, the visualization tool allows us to detect the parameters related to the strategies. This guarantees the feasibility of the optimization. Our initial experimental results on several different architectures show an up to 215% gain in execution time compared to an already optimized baseline implementation.
automation, robotics and control systems | 2008
Rainer Buchty; Oliver Mattes; Wolfgang Karl
A major problem considering parallel computing is maintaining memory consistency and coherency, and ensuring ownership and access rights. These problems mainly arise from the fact that memory in parallel and distributed systems is still managed locally, e.g. using a combination of shared-bus- and directory-based approaches. As a result, such setups do not scale well with system size and are especially unsuitable for systems where such centralized management instances cannot or must not be employed. As a potential solution to this problem we present SaM, the Self-aware Memory architecture. By using self-awareness, our approach provides a novel memory architecture concept targeting multimaster systems with special focus on autonomic, self-managing systems. Unlike previous attempts, the approach delivers a holistic, yet scalable and cost-friendly solution to several memory-related problems including maintaining coherency, consistency, and access rights.
automation, robotics and control systems | 2009
Rainer Buchty; David Kramer; Mario Kicherer; Wolfgang Karl
When targeting hardware accelerators and reconfigurable processing units, the question of programmability arises, i.e. how different implementations of individual, configuration-specific functions are provided. Conventionally, this is resolved either at compilation time with a specific hardware environment being targeted, by initialization routines at program start, or decision trees at run-time. Such technique are, however, hardly applicable to dynamically changing architectures. Furthermore, these approaches show conceptual drawbacks such as requiring access to source code and requiring upfront knowledge of future system configurations, as well as overloading the code with reconfiguration-related control routines. We therefore present a low-overhead technique enabling on-demand resolving of individual functions; this technique can be applied in two different manners; we will discuss the benefits of the individual implementations and show how both approaches can be used to establish code compatibility between different heterogeneous, reconfigurable, and parallel architectures. Further we will show, that both approaches are exposing an insignificant overhead.