Tobias Kenter
University of Paderborn
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Tobias Kenter.
design, automation, and test in europe | 2016
Achim Lösch; Tobias Beisel; Tobias Kenter; Christian Plessl; Marco Platzner
The use of heterogeneous computing resources, such as Graphic Processing Units or other specialized coprocessors, has become widespread in recent years because of their performance and energy efficiency advantages. Approaches for managing and scheduling tasks to heterogeneous resources are still subject to research. Although queuing systems have recently been extended to support accelerator resources, a general solution that manages heterogeneous resources at the operating system-level to exploit a global view of the system state is still missing. In this paper we present a user space scheduler that enables task scheduling and migration on heterogeneous processing resources in Linux. Using run queues for available resources we perform scheduling decisions based on the system state and on task characterization from earlier measurements. With a programming pattern that supports the integration of checkpoints into applications, we preempt tasks and migrate them between three very different compute resources. Considering static and dynamic workload scenarios, we show that this approach can gain up to 17% performance, on average 7%, by effectively avoiding idle resources. We demonstrate that a work-conserving strategy without migration is no suitable alternative.
reconfigurable computing and fpgas | 2014
Gavin Vaz; Heinrich Riebler; Tobias Kenter; Christian Plessl
Reconfigurable architectures provide an opportunity to accelerate a wide range of applications, frequently by exploiting data-parallelism, where the same operations are homogeneously executed on a (large) set of data. However, when the sequential code is executed on a host CPU and only data-parallel loops are executed on an FPGA coprocessor, a sufficiently large number of loop iterations (trip counts) is required, such that the control- and data-transfer overheads to the coprocessor can be amortized. However, the trip count of large data-parallel loops is frequently not known at compile time, but only at runtime just before entering a loop. Therefore, we propose to generate code both for the CPU and the coprocessor, and to defer the decision where to execute the appropriate code to the runtime of the application when the trip count of the loop can be determined just at runtime. We demonstrate how an LLVM compiler based toolflow can automatically insert appropriate decision blocks into the application code. Analyzing popular benchmark suites, we show that this kind of runtime decisions is often applicable. The practical feasibility of our approach is demonstrated by a toolflow that automatically identifies loops suitable for vectorization and generates code for the FPGA coprocessor of a Convey HC-1. The toolflow adds decisions based on a comparison of the runtime-computed trip counts to thresholds for specific loops and also includes support to move just the required data to the coprocessor. We evaluate the integrated toolflow with characteristic loops executed on different input data sizes.
field-programmable technology | 2013
Heinrich Riebler; Tobias Kenter; Christoph Sorge; Christian Plessl
Cold-boot attacks exploit the fact that DRAM contents are not immediately lost when a PC is powered off. Instead the contents decay rather slowly, in particular if the DRAM chips are cooled to low temperatures. This effect opens an attack vector on cryptographic applications that keep decrypted keys in DRAM. An attacker with access to the target computer can reboot it or remove the RAM modules and quickly copy the RAM contents to non-volatile memory. By exploiting the known cryptographic structure of the cipher and layout of the key data in memory, in our application an AES key schedule with redundancy, the resulting memory image can be searched for sections that could correspond to decayed cryptographic keys; then, the attacker can attempt to reconstruct the original key. However, the runtime of these algorithms grows rapidly with increasing memory image size, error rate and complexity of the bit error model, which limits the practicability of the approach. In this work, we study how the algorithm for key search can be accelerated with custom computing machines. We present an FPGA-based architecture on a Maxeler dataflow computing system that outperforms a software implementation up to 205x, which significantly improves the practicability of cold-attacks against AES.
applied reconfigurable computing | 2014
Tobias Kenter; Gavin Vaz; Christian Plessl
In order to leverage the use of reconfigurable architectures in general-purpose computing, quick and automated methods to find suitable accelerator designs are required. We tackle this challenge in both regards. In order to avoid long synthesis times, we target a vector coprocessor, implemented on the FPGAs of a Convey HC-1. Previous studies showed that existing tools were not able to accelerate a real-world application with low effort. We present a toolflow to automatically identify suitable loops for vectorization, generate a corresponding hardware/software bipartition, and generate coprocessor code. Where applicable, we leverage outer-loop vectorization. We evaluate our tools with a set of characteristic loops, systematically analyzing different dependency and data layout properties.
field-programmable custom computing machines | 2013
Heinrich Riebler; Tobias Kenter; Christian Plessl; Christoph Sorge
In this paper, we study how AES key schedules can be reconstructed from decayed memory. This operation is a crucial and time consuming operation when trying to break encryption systems with cold-boot attacks. In software, the reconstruction of the AES master key can be performed using a recursive, branch-and-bound tree-search algorithm that exploits redundancies in the key schedule for constraining the search space. In this work, we investigate how this branch-and-bound algorithm can be accelerated with FPGAs. We translate the recursive search procedure to a state machine with an explicit stack for each recursion level and create optimized datapaths to accelerate in particular the processing of the most frequently accessed tree levels. We support two different decay models, of which especially the more realistic non-idealized asymmetric decay model causes very high runtimes in software. Our implementation on a Maxeler dataflow computing system outperforms a software implementation for this model by up to 27x, which makes cold-boot attacks against AES practical even for high error rates.As a popular type of parallel evolutionary algorithms, distributed evolutionary algorithms (DEAs) are widely used in a variety of fields. To get better solutions of concrete problems, a scheme of subpopulation diversity based accepting immigrant in DEAs is proposed in this paper. In migration with this scheme, an immigrant will be put into its target subpopulation only if its current diversity is lower than a threshold value. Algorithm analysis shows that the extra cost of time for this scheme is acceptable in many DEAs. Experiments are conducted on instances of the Traveling Salesman Problem from the TSPLIB. Results show that the DEA based on the proposed scheme can get better solutions than the one without it.
field programmable gate arrays | 2011
Tobias Kenter; Christian Plessl; Marco Platzner; Michael Kauschke
In this paper we present a fast and fully automated approach for studying the design space when interfacing reconfigurable accelerators with a CPU. Our challenge is, that a reasonable evaluation of architecture parameters requires a hardware/software partitioning that makes best use of each given architecture configuration. Therefore we developed a framework based on the LLVM infrastructure that performs this partitioning with high-level estimation of the runtime on the target architecture utilizing profiling information and code analysis. By making use of program characteristics also during the partitioning process, we improve previous results for various benchmarks and especially for growing interface latencies between CPU and accelerator.
Computers & Electrical Engineering | 2016
Gavin Vaz; Heinrich Riebler; Tobias Kenter; Christian Plessl
Display Omitted A broad spectrum of applications can be accelerated by offloading computation intensive parts to reconfigurable hardware. However, to achieve speedups, the number of loop iterations (trip count) needs to be sufficiently large to amortize offloading overheads. Trip counts are frequently not known at compile time, but only at runtime just before entering a loop. Therefore, we propose to generate code for both the CPU and the coprocessor, and defer the offloading decision to the application runtime. We demonstrate how a toolflow, based on the LLVM compiler framework, can automatically embed dynamic offloading decisions into the application code. We perform in-depth static and dynamic analysis of popular benchmarks, which confirm the general potential of such an approach. We also propose to optimize the offloading process by decoupling the runtime decision from the loop execution (decision slack). The feasibility of our approach is demonstrated by a toolflow that automatically identifies suitable data-parallel loops and generates code for the FPGA coprocessor of a Convey HC-1. We evaluate the integrated toolflow with representative loops executed for different input data sizes.
field programmable logic and applications | 2017
Tobias Kenter; Jens Förstner; Christian Plessl
Compared to classical HDL designs, generating FPGA with high-level synthesis from an OpenCL specification promises easier exploration of different design alternatives and, through ready-to-use infrastructure and common abstractions for host and memory interfaces, easier portability between different FPGA families. In this work, we evaluate the extent of this promise. To this end, we present a parameterized FDTD implementation for photonic microcavity simulations. Our design can trade-off different forms of parallelism and works for two independent OpenCL-based FPGA design flows. Hence, we can target FPGAs from different vendors and different FPGA families. We describe how we used pre-processor macros to achieve this flexibility and to work around different shortcomings of the current tools. Choosing the right design configurations, we are able to present two extremely competitive solutions for very different FPGA targets, reaching up to 172 GFLOPS sustained performance. With the portability and flexibility demonstrated, code developers not only avoid vendor lock-in, but can even make best use of real trade-offs between different architectures.
International Journal of Reconfigurable Computing | 2015
Tobias Kenter; Henning Schmitz; Christian Plessl
FPGAs are known to permit huge gains in performance and efficiency for suitable applications but still require reduced design efforts and shorter development cycles for wider adoption. In this work, we compare the resulting performance of two design concepts that in different ways promise such increased productivity. As common starting point, we employ a kernel-centric design approach, where computational hotspots in an application are identified and individually accelerated on FPGA. By means of a complex stereo matching application, we evaluate two fundamentally different design philosophies and approaches for implementing the required kernels on FPGAs. In the first implementation approach, we designed individually specialized data flow kernels in a spatial programming language for a Maxeler FPGA platform; in the alternative design approach, we target a vector coprocessor with large vector lengths, which is implemented as a form of programmable overlay on the application FPGAs of a Convey HC-1. We assess both approaches in terms of overall system performance, raw kernel performance, and performance relative to invested resources. After compensating for the effects of the underlying hardware platforms, the specialized dataflow kernels on the Maxeler platform are around 3× faster than kernels executing on the Convey vector coprocessor. In our concrete scenario, due to trade-offs between reconfiguration overheads and exposed parallelism, the advantage of specialized dataflow kernels is reduced to around 2.5×
reconfigurable computing and fpgas | 2014
Tobias Kenter; Henning Schmitz; Christian Plessl
Stereo-matching algorithms recently received a lot of attention from the FPGA acceleration community. Presented solutions range from simple, very resource efficient systems with modest matching quality for small embedded systems to sophisticated algorithms with several processing steps, implemented on big FPGAs. In order to achieve high throughput, most implementations strongly focus on pipelining and data reuse between different computation steps. This approach leads to high efficiency, but limits the supported computation patterns and due the high integration of the implementation, adaptions to the algorithm are difficult. In this work, we present a stereo-matching implementation, that starts by offloading individual kernels from the CPU to the FPGA. Between subsequent compute steps on the FPGA, data is stored off-chip in on-board memory of the FPGA accelerator card. This enables us to accelerate the AD-census algorithm with cross-based aggregation and scanline optimization for the first time without algorithmic changes and for up to full HD image dimensions. Analyzing throughput and bandwidth requirements, we outline some trade-offs that are involved with this approach, compared to tighter integration of more kernel loops into one design.