Jens Huthmann
Technische Universität Darmstadt
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jens Huthmann.
reconfigurable communication centric systems on chip | 2013
Jens Huthmann; Björn Liebig; Julian Oppermann; Andreas Koch
The Nymble compiler system accepts C code, annotated by the user with partitioning directives, and translates the indicated parts into hardware accelerators for execution on FPGA-based reconfigurable computers. The interface logic between the remaining software parts and the accelerators is automatically created, taking into account details such as cache flushes and copying of FPGA-local memories to the shared main memory. The system also supports calls from hardware back into software, both for infrequent operations that do not merit hardware area, as well as for using operating system / library services such as memory management and I/O.
field-programmable logic and applications | 2011
Benjamin Thielmann; Jens Huthmann; Andreas Koch
We propose a universal method to automatically generate both data paths and the appropriate application-specific speculation-support logic from high-level C-language descriptions. Our approach aims to be lightweight by extending efficient statically-scheduled micro architectures with a limited dynamic token model to predict, commit, and replay speculation events. As a first source of speculative ness, we evaluate the use of data-value speculation to speed-up memory reads when targeting a reconfigurable adaptive computer.
reconfigurable communication centric systems on chip | 2011
Benjamin Thielmann; Jens Huthmann; Andreas Koch
The PreCoRe approach allows the automatic generation of application-specific microarchitectures from C, thus supporting complex speculative execution on reconfigurable computers. In this work, we present the PreCoRe capability of using data-value speculation to reduce the latency of memory reads, as well as the lightweight extension of static datapath controllers to the dynamic replay of misspeculated operations. The experimental evaluation considers the performance / area impact of the approach and also discusses the individual effects of combining different speculation mechanisms.
field programmable logic and applications | 2014
Jens Huthmann; Julian Oppermann; Andreas Koch
We describe extending the hardware/software co-compiler Nymble to automatically generate multi-threaded (SIMT) hardware accelerators. In contrast to prior work that simply duplicated complete compute units for each thread, Nymble-MT reuses the actual computation elements, and adds just the required data storage and context switching logic. On the CHStone benchmark suite and a sample configuration of four threads, the prototype can up to quadruple the throughput, but with a chip area just 5% larger than that of a single-threaded accelerator.
international conference on embedded computer systems: architectures, modeling, and simulation | 2010
Jens Huthmann; Peter Müller; Florian Stock; Dietmar Hildenbrand; Andreas Koch
Geometric Algebra (GA), a generalization of quaternions, is a very powerful form for intuitively expressing and manipulating complex geometric relationships common to engineering problems. The actual evaluation of GA expressions, though, is extremely compute intensive due to the high-dimensionality of data being processed. On standard desktop CPUs, GA evaluations take considerably longer than conventional mathematical formulations. GPUs do offer sufficient throughput to make the use of concise GA formulations practical, but require power far exceeding the budgets for most embedded applications. While the suitability of low-power reconfigurable accelerators for evaluating specific GA computations has already been demonstrated, these often required a significant manual design effort. We present a proof-of-concept compile flow combining symbolic and hardware optimization techniques to automatically generate accelerators from the abstract GA descriptions without user intervention that are suitable for high-performance embedded computing.
ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013
Björn Liebig; Jens Huthmann; Andreas Koch
Multiply-add operations form a crucial part of many digital signal processing and control engineering applications. Since their performance is crucial for the application-level speed-up, it is worthwhile to explore a wide spectrum of implementations alternatives, trading increased area/energy usage to speed-up units on the critical path of the computation. This paper examines existing solutions and proposes two new architectures for floating-point fused multiply-adds, and also considers the impact of different in-fabric features of recent FPGA architectures. The units rely on different degrees of carry-save arithmetic improve performance by up to 2.5x over the closest state-of-the-art competitor. They are evaluated at the application level by modifying an existing high-level synthesis system to automatically insert the new units for computations on the critical path of three different convex solvers.
reconfigurable computing and fpgas | 2011
Benjamin Thielmann; Thorsten Wink; Jens Huthmann; Andreas Koch
Increasing the degree of speculative execution in application-specific micro architectures, which can be generated for reconfigurable computers from high-level code using techniques such as PreCoRe, also leads to an increased pressure on the memory system. The RAP approach introduced here describes and evaluates application specific micro architectural techniques to reduce the impact of aggressively speculated memory accesses. It covers a light-weight resolution mechanism for dynamic RAW memory dependencies, avoiding execution replays due to miss peculated reads, and a prioritization scheme for arbitrating the use of shared resources based on the degree of speculative ness of the individual access.
field-programmable technology | 2015
Jens Huthmann; Andreas Koch
Recent high-level synthesis tools offer the capability to generate multi-threaded micro-architectures to hide memory access latencies. In many HLS flows, this is often achieved by just creating multiple processing element-instances (one for each thread). However, more advanced compilers can synthesize hardware in a spatial form of the barrel processor- or simultaneous multi-threading (SMT) approaches, where only state storage is replicated per thread, while the actual hardware operators in a single datapath are re-used between threads. The spatial nature of the micro-architecture applies not only to the hardware operators, but also to the thread scheduling facility, which itself is spatially distributed across the entire datapath in separate hardware stages. Since each of these thread scheduling stages, which also allow a re-ordering of threads, adds hardware overhead, it is worthwhile to examine how their number can be reduced while maintaining the performance of the entire datapath. We report on a number of thinning options and examine their impact on system performance. For kernels from the MachSuite HLS benchmark collection, we have achieved area savings of up to 50% LUTs and 50% registers, while maintaining full performance for the compiled hardware accelerators.
ACM Transactions on Reconfigurable Technology and Systems | 2012
Benjamin Thielmann; Jens Huthmann; Andreas Koch
Load value speculation has long been proposed as a method to hide the latency of memory accesses. It has seen very limited use in actual processors, often due to the high overhead of reexecuting misspeculated computations. We present PreCoRe, a framework capable of generating application-specific microarchitectures supporting load value speculation on reconfigurable computers. The article examines the lightweight speculation and replay mechanisms, the architecture of the actual data value prediction units as well as the impact on the nonspeculative parts of the memory system. In experiments, using PreCoRe has achieved speedups of up to 2.48 times over nonspeculative implementations.
Archive | 2013
Benjamin Thielmann; Jens Huthmann; Thorsten Wink; Andreas Koch
The rate of improvement in the single-thread performance of conventional central processing units (CPUs) has decreased significantly over the last decade. This is mainly due to the difficulties in obtaining higher clock frequencies. As a consequence, the focus of development has shifted to multi-threaded execution models and multi-core CPU designs instead. Unfortunately, there are still many important algorithms and applications that cannot easily be rewritten to take advantage of this new computing paradigm. Thus, the performance gap between parallelizable algorithms and those depending on single-thread performance has widened significantly. Application-specific hardware accelerators with optimized pipelines are able to provide improved single-thread performance but have only limited flexibility and require high development effort compared to programming software-programmable processors (SPPs).