Pascal Schleuniger
Technical University of Denmark
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Pascal Schleuniger.
design, automation, and test in europe | 2011
Martin Schoeberl; Pascal Schleuniger; Wolfgang Puffitsch; Florian Brandner; Christian W. Probst; Sven Karlsson; Tommy Thorn
Current processors are optimized for average case performance, often leading to a high worst-case execution time (WCET). Many architectural features that increase the average case performance are hard to be modeled for the WCET analysis. In this paper we present Patmos, a processor optimized for low WCET bounds rather than high average case performance. Patmos is a dual- issue, statically scheduled RISC processor. The instruction cache is organized as a method cache and the data cache is organized as a split cache in order to simplify the cache WCET analysis. To fill the dual-issue pipeline with enough useful instructions, Patmos relies on a customized compiler. The compiler also plays a central role in optimizing the application for the WCET instead of average case performance.
automation, robotics and control systems | 2013
Pascal Schleuniger; Anders Kusk; Jørgen Dall; Sven Karlsson
Synthetic aperture radar, SAR, is a high resolution imaging radar. The direct back-projection algorithm allows for a precise SAR output image reconstruction and can compensate for deviations in the flight track of airborne radars. Often graphic processing units, GPUs are used for data processing as the back-projection algorithm is computationally expensive and highly parallel. However, GPUs may not be an appropriate solution for applications with strictly constrained space and power requirements. In this paper, we describe how we map a SAR direct back-projection application to a multi-core system on an FPGA. The fabric consisting of 64 processor cores and 2D mesh interconnect utilizes 60% of the hardware resources of a Xilinx Virtex-7 device with 550 thousand logic cells and consumes about 10 watt. We apply software pipelining to hide memory latency and reduce the hardware footprint by 14%. We show that the system provides real-time processing of a SAR application that maps a 3000m wide area with a resolution of 2x2 meters.
applied reconfigurable computing | 2010
Michael Reibel Boesen; Pascal Schleuniger; Jan Madsen
The increasing demand for robust and reliable systems is pushing the development of adaptable and reconfigurable hardware platforms. However, the robustness and reliability comes at an overhead in terms of longer execution times and increased areas of the platforms. In this paper we are interested in exploring the performance overhead of executing applications on a specific platform, eDNA ?, which is a bio-inspired reconfigurable hardware platform with self-healing capabilities. eDNA is a scalable and coarse grained architecture consisting of an array of interconnected cells aimed at defining a new type of FPGAs. We study the performance of an emulation of the platform implemented as a PicoBlaze-based multi-core architecture with up to 49 cores realised on a Xilinx Virtex-II Pro FPGA. We show a performance overhead compared to a single processor solution, which is in the range of 2x - 18x, depending on the size and complexity of the application. Although this is very large, most of it can be attributed the limitations of the PicoBlaze-based prototype implementation. More important, the overhead after self-healing, where up to 30-75% faulty cells are replaced by spare cells on the platform, is less than 20%.
automation, robotics and control systems | 2012
Pascal Schleuniger; Sally A. McKee; Sven Karlsson
As FPGAs get more competitive, synthesizable processor cores become an attractive choice for embedded computing. Currently popular commercial processor cores do not fully exploit current FPGA architectures. In this paper, we propose general design principles to increase instruction throughput on FPGA-based processor cores: first, superpipelining enables higher-frequency system clocks, and second, predicated instructions circumvent costly pipeline stalls due to branches. To evaluate their effects, we develop Tinuso, a processor architecture optimized for FPGA implementation. We demonstrate through the use of micro-benchmarks that our principles guide the design of a processor core that improves performance by an average of 38% over a similar Xilinx MicroBlaze configuration.
conference on design and architectures for signal and image processing | 2014
Andreas Erik Hindborg; Pascal Schleuniger; Nicklas Bo Jensen; Sven Karlsson
Field-programmable gate arrays, FPGAs, are attractive implementation platforms for low-volume signal and image processing applications. The structure of FPGAs allows for an efficient implementation of parallel algorithms. Sequential algorithms, on the other hand, often perform better on a microprocessor. It is therefore convenient for many applications to employ a synthesizable microprocessor to execute sequential tasks and custom hardware structures to accelerate parallel sections of an algorithm. In this paper, we discuss the hardware realization of Tinuso-I, a small synthesizable processor core that can be integrated in many signal and data processing platforms on FPGAs. We also show how we allow the processor to use operating system services. For a set of SPLASH-2 and SPEC CPU2006 benchmarks we show a speedup of up to 64% over a similar Xilinx MicroBlaze implementation while using 27% to 35% fewer hardware resources.
asilomar conference on signals, systems and computers | 2014
Andreas Erik Hindborg; Pascal Schleuniger; Nicklas Bo Jense; Maxwell Walter; Laust Brock-Nannestad; Lars Frydendal Bonnichsen; Christian W. Probst; Sven Karlsson
High performance computing systems make increasing use of hardware accelerators to improve performance and power properties. For large high-performance FPGAs to be successfully integrated in such computing systems, methods to raise the abstraction level of FPGA programming are required. In this paper we propose a tool flow, which automatically generates highly optimized hardware multicore systems based on parameters. Profiling feedback is used to adjust these parameters to improve performance and lower the power consumption. For an image processing application we show that our tools are able to identify optimal performance energy trade-offs points for a multicore based FPGA accelerator.
applied reconfigurable computing | 2014
Pascal Schleuniger; Sven Karlsson
Active microwave imaging techniques such as radar and tomography are used in a wide range of medical, industrial, scientific, and military applications. Microwave imaging devices emit radio waves and process their reflections to reconstruct an image. However, data processing remains a challenge as image reconstruction algorithms are computationally expensive and many applications come with strictly constrained mechanical or power requirements. We developed Tinuso, a multicore architecture optimized for performance when implemented on an FPGA. Tinuso’s architecture is well suited to run highly parallel image reconstruction applications at a low power budget. In this paper, we describe the design and the implementation of Tinuso’s communication structures, which include a generic 2D mesh on-chip interconnect and a network interface to the processor pipeline. We optimize the network for a latency of one cycle per network hop and attain a high clock frequency by pipelining the feedback loop to manage contention. We implement a multicore configuration with 48 cores and achieve a clock frequency as high as 300 MHz with a peak switching data rate of 9.6 Gbits/s per link on state-of-the-art FPGAs.
international parallel and distributed processing symposium | 2015
Nicklas Bo Jensen; Pascal Schleuniger; Andreas Erik Hindborg; Maxwell Walter; Sven Karlsson
Field programmable gate arrays, FPGAs, have become an attractive implementation technology for a broad range of computing systems. We recently proposed a processor architecture, Tinuso, which achieves high performance by moving complexity from hardware to the compiler tool chain. This means that the compiler tool chain must handle the increased complexity. However, it is not clear if current production compilers can successfully meet the strict constraints on instruction order and generate efficient object code. In this paper, we present our experiences developing a compiler backend using the GNU Compiler Collection, GCC. For a set of C benchmarks, we show that a Tinuso implementation with our GCC backend reaches a relative speedup of up to 1.73 over a similar Xilinx Micro Blaze configuration while using 30% fewer hardware resources. While our experiences are generally positive, we expose some limitations in GCC that need to be addressed to achieve the full performance potential of Tinuso.
Archive | 2014
Julien Mottin; François Pacull; Ronan Keryell; Pascal Schleuniger
In SMECY, we believe that an efficient tool chain could only be defined when the type of parallelism required by an application domain and the hardware architecture is fixed. Furthermore, we believe that once a set of tools is available, it is possible with reasonable effort to change hardware architectures or change the type of parallelism exploited.
Swedish workshop on multi-core computing | 2010
Pascal Schleuniger; Sven Karlsson