Markus Weinhardt
Karlsruhe Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Markus Weinhardt.
The Journal of Supercomputing | 2003
V. Baumgarte; G. Ehlers; Frank May; A. Nückel; Martin Vorbach; Markus Weinhardt
The eXtreme Processing Platform (XPPTM) is a new runtime-reconfigurable data processing architecture. It is based on a hierarchical array of coarsegrain, adaptive computing elements, and a packet-oriented communication network. The strength of the XPPTM technology originates from the combination of array processing with unique, powerful run-time reconfiguration mechanisms. Parts of the array can be configured rapidly in parallel while neighboring computing elements are processing data. Reconfiguration is triggered externally or even by special event signals originating within the array, enabling self-reconfiguring designs. The XPPTM architecture is designed to support different types of parallelism: pipelining, instruction level, data flow, and task level parallelism. Therefore this technology is well suited for applications in multimedia, telecommunications, simulation, signal processing (DSP), graphics, and similar stream-based application domains. The anticipated peak performance of the first commercial device running at 150MHz is estimated to be 57.6 GigaOps/sec, with a peak I/O bandwidth of several GByte/sec. Simulated applications achieve up to 43.5 GigaOps/sec (32-bit fixed point).
field programmable custom computing machines | 1999
Markus Weinhardt; Wayne Luk
This paper presents pipeline vectorization, a method for synthesizing hardware pipelines in reconfigurable systems based on software vectorizing compilers. The method improves efficiency and ease of development of reconfigurable designs, particularly for users with little electronics design experience. We propose several loop transformations to customize pipelines to meet hardware resource constraints, while maximising available parallelism. For ran-time reconfigurable systems, we apply hardware specialization to increase circuit utilization. Our approach is especially effective for highly repetitive computations in DSP and multimedia applications. Case studies using FPGA-based platforms are presented to demonstrate the benefits of our approach and to evaluate trade-offs between alternative implementations. The loop tiling transformation, for instance, has been found to improve performance by 30 to 40 times above a PC-based software implementation, depending on whether run-time reconfiguration is used.
ACM Computing Surveys | 2010
João M. P. Cardoso; Pedro C. Diniz; Markus Weinhardt
Reconfigurable computing platforms offer the promise of substantially accelerating computations through the concurrent nature of hardware structures and the ability of these architectures for hardware customization. Effectively programming such reconfigurable architectures, however, is an extremely cumbersome and error-prone process, as it requires programmers to assume the role of hardware designers while mastering hardware description languages, thus limiting the acceptance and dissemination of this promising technology. To address this problem, researchers have developed numerous approaches at both the programming languages as well as the compilation levels, to offer high-level programming abstractions that would allow programmers to easily map applications to reconfigurable architectures. This survey describes the major research efforts on compilation techniques for reconfigurable computing architectures. The survey focuses on efforts that map computations written in imperative programming languages to reconfigurable architectures and identifies the main compilation and synthesis techniques used in this mapping.
field programmable logic and applications | 2002
João M. P. Cardoso; Markus Weinhardt
The eXtreme Processing Platform (XPP) is a unique reconfigurable computing (RC) architecture supported by a complete set of design tools. This paper presents the XPP Vectorizing C Compiler XPP-VC, the first high-level compiler for this architecture. It uses new mapping techniques, combined with efficient vectorization. A temporal partitioning phase guarantees the compilation of programs with unlimited complexity, provided that only the supported C subset is used. A new loop partitioning scheme permits to map large loops of any kind. It is not constrained by loop dependences or nesting levels. To our knowledge, the compilation performance is unmatched by any other compiler for RC. Preliminary evaluations show compilation times of only a few seconds from C code to configuration binaries and performance speedups over standard microprocessor implementations. The overall technology represents a significant step toward RC architectures which are faster and simpler to program.
design, automation, and test in europe | 2003
João M. P. Cardoso; Markus Weinhardt
The emergence of run-time reconfigurable architectures makes feasible the configure-execute paradigm. Compilation of behavioral descriptions (in, e.g., C, Java, etc.), apart from mapping the computational structures onto the available resources on the device, must split the program in temporal sections if it needs more resources than physically available. In addition, since the execution of the computational structures in a configuration needs at least two stages (i.e., configuring and computing), it is important to split the program such that the reconfiguration overheads are minimized, taking advantage of the overlapping of the execution stages on different configurations. This paper presents mapping techniques to cope with those features. The techniques are being researched in the context of a C compiler for the eXtreme Processing Platform (XPP). Temporal partitioning is applied to furnish a set of configurations that reduces the reconfiguration overhead and thus may lead to performance gains. We also show that when applications include a sequence of loops, the use of several configurations may be more beneficial than the mapping of the entire application onto a single configuration. Preliminary results for a number of benchmarks strongly confirm the approach.
field-programmable technology | 2004
Markus Weinhardt; Martin Vorbach; V. Baumgarte; Frank May
This work presents function folding, a design principle to improve the silicon efficiency of reconfigurable arithmetic (coarse-grain) arrays. Though highly parallel implementations of DSP algorithms have been demonstrated on these arrays, the overall silicon efficiency of current devices is limited by both the large numbers of ALUs required in the array and by the only moderate speeds which are achieved. The operating frequencies are mainly limited by the requirements of nonlocal routing connections. We present a novel approach to overcome these limitations: In function folding, a small number of distinct operators belonging to the same configuration are folded onto the same ALU, i.e. executed sequentially on one processing element. The ALU is controlled by a program repetitively executing the same instruction sequence. Data only required locally is stored in a local register file. This sequential approach uses the individual ALU resources more efficiently, while all processing elements of the array work in parallel as in current devices. Additionally, the ALUs and local registers can be clocked with a higher frequency than the (nonlocal) routing connections. Overall, a higher computational density than in current devices results.
field programmable logic and applications | 1999
Markus Weinhardt; Wayne Luk
This paper describes memory access optimization in the context of pipeline vectorization, a method for synthesizing hardware pipe- lines in reconfigurable systems from software program loops. Since many algorithms for reconfigurable coprocessors are I/O bound, the throughput of the coprocessor is determined by the external memory accesses. Thus access optimizations directly improve the system’s performance. Two kinds of optimizations have been studied. First, we consider methods for reducing the number of accesses based on saving frequently-used data in on-chip storage. In particular, recent FPGAs provide on-chip RAM which can be used for this purpose. We present RAM inference, a technique which automatically extracts small on-chip RAMs to reduce external memory accesses. Second, we aim to minimize the time spent on external accesses by scheduling as many accesses in parallel as possible. This optimization only applies to architectures with multiple memory banks. We present a technique which allocates program arrays to memory banks, thereby minimizing the overall access time.
field programmable logic and applications | 1995
Markus Weinhardt
This paper presents a new partitioning method for software oriented hardware/software codesign. It is applied to the use of field-programmable accelerator boards. In the underlying model the dedicated hardware has no direct access to the host memory, and communication is slow. Therefore detailed data-flow information is necessary to minimize the communication overhead between host and accelerator board. The partitioning problem is formulated as an integer (linear) program which simultaneously determines which code regions should be implemented in dedicated hardware and which data has to be communicated, so that well-known optimization algorithms can be applied.
Archive | 2005
João M. P. Cardoso; Markus Weinhardt
The eXtreme Processing Platform (XPP) is a coarse-grained dynamically reconfigurable architecture. Its advanced reconfiguration features make feasible the configure-execute paradigm, the natural paradigm of dynamically reconfigurable computing. This chapter presents a compiler aiming to program the XPP using a subset of the C language. The compiler, apart from mapping the computational structures onto the available resources on the device, splits the program in temporal sections when it needs more resources than the physically available. In addition, since the execution of the computational structures in a configuration needs at least two stages (e.g., configuring and computing), a scheme to split the program such that the reconfiguration overheads are minimized, taking advantage of the overlapping of the execution stages on different configurations is presented.
field programmable logic and applications | 1996
Markus Weinhardt
Several programming methodologies based on high-level languages have been proposed for FPGA-based Custom Computing Machines (FCCMs). But most of these methods either use different languages for hardware and software specification, require the programmer to partition the system manually, or yield an unsatisfying speedup due to the limitations of a sequential input language. Furthermore, the existing systems are often limited to one FCCM architecture or to a specific application domain. This paper presents an integrated high-level language based programming approach for FCCMs which allows automatic partitioning and exploits hardware parallelism by synthesising pipelined circuits from parallel FOR-loops. Experiments with a modular prototype implementation designed for portability show the feasibility of this approach.