Is this you? Create Your Porfile

Manuel Mohr

Karlsruhe Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Manuel Mohr is active.

Explore More

Publication

Featured researches published by Manuel Mohr.

compilers architecture and synthesis for embedded systems | 2013

Hardware acceleration for programs in SSA form

Manuel Mohr; Artjom Grudnitsky; Tobias Modschiedler; Lars Bauer; Sebastian Hack; Jörg Henkel

Register allocation is one of the most time-consuming parts of the compilation process. Depending on the quality of the register allocation, a large amount of shuffle code to move values between registers is generated. In this paper, we propose a processor architecture extension to provide register file permutations by which the shuffle code can be implemented more efficiently. We present compiler support to utilize this extension, an evaluation regarding performance and compilation time using the SPEC CINT2000 benchmark, as well as an analysis of area and frequency overhead of our architecture implementation. We find that using our extension, the number of executed instructions is reduced by up to 5.1 % while the compilation time is unaffected.

international conference on software engineering | 2011

Permission-based programming languages (NIER track)

Jonathan Aldrich; Ronald Garcia; Mark Hahnenberg; Manuel Mohr; Karl Naden; Darpan Saini; Sven Stork; Joshua Sunshine; Éric Tanter; Roger Wolff

Linear permissions have been proposed as a lightweight way to specify how an object may be aliased, and whether those aliases allow mutation. Prior work has demonstrated the value of permissions for addressing many software engineering concerns, including information hiding, protocol checking, concurrency, security, and memory management. We propose the concept of a permission-based programming language - a language whose object model, type system, and runtime are all co-designed with permissions in mind. This approach supports an object model in which the structure of an object can change over time, a type system that tracks changing structure in addition to addressing the other concerns above, and a runtime system that can dynamically check permission assertions and leverage permissions to parallelize code. We sketch the design of the permission-based programming language Plaid, and argue that the approach may provide significant software engineering benefits.

conference on object-oriented programming systems, languages, and applications | 2011

Plaid: a permission-based programming language

Jonathan Aldrich; Robert L. Bocchino; Ronald Garcia; Mark Hahnenberg; Manuel Mohr; Karl Naden; Darpan Saini; Sven Stork; Joshua Sunshine; Éric Tanter; Roger Wolff

Access permissions (permissions for short) are a lightweight way to specify how an object may be aliased and whether aliases allow mutation. Prior work has demonstrated the value of permissions for addressing many software engineering concerns, including information hiding, protocol checking, concurrency, security, and memory management. We propose a permission-based programming language: that is, a language whose object model, type system, and runtime are all co-designed with permissions in mind. The key elements of such a language are (1) an object model in which the structure of an object can change over time; (2) a type system that tracks changing structure in addition to addressing concerns such as those listed above; and (3) a runtime system that dynamically checks permission assertions and leverages permissions to parallelize code. We sketch the design of the permission-based programming language Plaid and argue that the approach promises significant software engineering benefits.

Proceedings of the ACM SIGPLAN Workshop on X10 | 2015

Cutting out the middleman: OS-level support for x10 activities

Manuel Mohr; Sebastian Buchwald; Andreas Zwinkau; Christoph Erhardt; Benjamin Oechslein; Jens Schedel; Daniel Lohmann

In the X10 language, computations are modeled as lightweight threads called activities. Since most operating systems only offer relatively heavyweight kernel-level threads, the X10 runtime system implements a user-space scheduler to map activities to operating-system threads in a many-to-one fashion. This approach can lead to suboptimal scheduling decisions or synchronization overhead. In this paper, we present an alternative X10 runtime system that targets OctoPOS, an operating system designed from the ground up for highly parallel workloads on PGAS architectures. OctoPOS offers an unconventional execution model based on i-lets, lightweight self-contained units of computation with (mostly) run-to-completion semantics that can be dispatched very efficiently. We are able to do a 1-to-1 mapping of X10 activities to i-lets, which results in a slim runtime system, avoiding the need for user-level scheduling and its costs. We perform microbenchmarks on a prototype many-core hardware architecture and show that our system needs fewer than 2000 clock cycles to spawn local and remote activities.

Archive | 2018

Aspects of Code Generation and Data Transfer Techniques for Modern Parallel Architectures

Manuel Mohr

Im Bereich der Prozessorarchitekturen hat sich der Fokus neuer Entwicklungen von immer hoheren Taktfrequenzen hin zu immer mehr Kernen auf einem Chip verschoben. Eine hohe Kernanzahl ermoglicht es unterschiedlich leistungsfahige Kerne anzubieten, und sogar dedizierte Kerne mit speziellen Befehlssatzen. Die Entwicklung fur solch heterogene Plattformen ist herausfordernd und benotigt entsprechende Unterstutzung von Entwicklungswerkzeugen, wie beispielsweise Ubersetzern. Neben ihrer heterogenen Kernstruktur gibt es eine zweite Dimension, die die Entwicklung fur solche Architekturen anspruchsvoll macht: ihre Speicherstruktur. Die Aufrechterhaltung von globaler Cache-Koharenz erschwert das Erreichen hoher Kernzahlen. Hardwarebasierte Cache-Koharenz-Protokolle skalieren entweder schlecht, oder sind kompliziert und fuhren zu Problemen bei Ausfuhrungszeit und Energieeffizienz. Eine radikale Losung dieses Problems stellt die Abschaffung der globalen Cache-Koharenz dar. Jedoch ist es schwierig, bestehende Programmiermodelle effizient auf solch eine Hardware-Architektur mit schwachen Garantien abzubilden. Der erste Teil dieser Dissertation beschaftigt sich Datentransfertechniken fur nicht-cache-koharente Architekturen mit gemeinsamem Speicher. Diese Architekturen bieten einen gemeinsamen physikalischen Adressraum, implementieren aber keine hardwarebasierte Koharenz zwischen allen Caches des Systems. Die logische Partitionierung des gemeinsamen Speichers ermoglicht die sichere Programmierung einer solchen Plattform. Im Allgemeinen erzeugt dies die Notwendigkeit Daten zwischen Speicherpartitionen zu kopieren. Wir untersuchen die Ubersetzung fur invasive Architekturen, einer Familie von nicht-cache-koharenten Vielkernarchitekturen. Wir betrachten die effiziente Implementierung von Datentransfers sowohl einfacher als auch komplexer Datenstrukturen auf invasiven Architekturen. Insbesondere schlagen wir eine neuartige Technik zum Kopieren komplexer verzeigerter Datenstrukturen vor, die ohne Serialisierung auskommt. Hierzu verallgemeinern wir den Objekt-Klon-Ansatz mit ubersetzergesteuerter automatischer software-basierter Koharenz, sodass er auch im Kontext nicht-koharenter Caches funktioniert. Wir prasentieren Implementierungen mehrerer Datentransfertechniken im Rahmen eines existierenden Ubersetzers und seines Laufzeitsystems. Wir fuhren eine ausfuhrliche Auswertung dieser Implementierungen auf einem FPGA-basierten Prototypen einer invasiven Architektur durch. Schlieslich schlagen wir vor, Hardwareunterstutzung fur bereichsbasierte Cache-Operationen hinzuzufugen und beschreiben und bewerten mogliche Implementierungen und deren Kosten. Der zweite Teil dieser Dissertation befasst sich mit der Beschleunigung von Shuffle-Code, der bei der Registerzuteilung auftritt, durch die Verwendung von Permutationsbefehlen. Die Aufgabe der Registerzuteilung wahrend der Programmubersetzung ist die Abbildung von Programmvariablen auf Maschinenregister. Wahrend der Registerzuteilung erzeugt der Ubersetzer Shuffle-Code, der aus Kopier- und Tauschbefehlen besteht, um Werte zwischen Registern zu transferieren. Abhangig von der Qualitat der Registerzuteilung und der Zahl der verfugbaren Register kann eine grose Menge an Shuffle-Code erzeugt werden. Wir schlagen vor, die Ausfuhrung von Shuffle-Code mit Hilfe von neuartigen Permutationsbefehlen zu beschleunigen, die die Inhalte von einigen Registern in einem Taktzyklus beliebig permutieren. Um die Machbarkeit dieser Idee zu demonstrieren, erweitern wir zunachst ein bestehendes RISC-Befehlsformat um Permutationsbefehle. Anschliesend beschreiben wir, wie die vorgeschlagenen Permutationsbefehle in einer bestehenden RISC-Architektur implementiert werden konnen. Dann entwickeln wir zwei Verfahren zur Codeerzeugung, die die Permutationsbefehle ausnutzen, um Shuffle-Code zu beschleunigen: eine schnelle Heuristik und einen auf dynamischer Programmierung basierenden optimalen Ansatz. Wir beweisen Qualitats- und Korrektheitseingeschaften beider Ansatze und zeigen die Optimalitat des zweiten Ansatzes. Im Folgenden implementieren wir beide Codeerzeugungsverfahren in einem Ubersetzer und untersuchen sowie vergleichen deren Codequalitat ausfuhrlich mit Hilfe standardisierter Benchmarks. Zunachst messen wir die genaue Zahl der dynamisch ausgefuhrten Befehle, welche wir folgend validieren, indem wir Programmlaufzeiten auf einer FPGA-basierten Prototypimplementierung der um Permutationsbefehle erweiterten RISC-Architektur messen. Schlieslich argumentieren wir, dass Permutationsbefehle auf modernen Out-Of-Order-Prozessorarchitekturen, die bereits Registerumbenennung unterstutzen, mit wenig Aufwand implementierbar sind.

european conference on parallel processing | 2017

Shallow Water Waves on a Deep Technology Stack : Accelerating a Finite Volume Tsunami Model Using Reconfigurable Hardware in Invasive Computing

Alexander Pöppl; Marvin Damschen; Florian Schmaus; Andreas Fried; Manuel Mohr; Matthias Blankertz; Lars Bauer; Jörg Henkel; Michael Bader

Reconfigurable architectures are commonly used in the embedded systems domain to speed up compute-intensive tasks. They combine a reconfigurable fabric with a general-purpose microprocessor to accelerate compute-intensive tasks on the fabric while the general-purpose CPU is used for the rest of the workload. Through the use of invasive computing, we aim to show the feasibility of this technology for HPC scenarios. We demonstrate this by accelerating a proxy application for the simulation of shallow water waves using the i-Core, a reconfigurable processor that is part of the invasive computing multiprocessor system-on-chip. Using a floating-point custom instruction, the entire computation of numerical fluxes occurring in the application’s finite volume scheme is performed by hardware accelerators.

design, automation, and test in europe | 2017

Pegasus: Efficient data transfers for PGAS languages on non-cache-coherent many-cores

Manuel Mohr; Carsten Tradowsky

To improve scalability, some many-core architectures abandon global cache coherence, but still provide a shared address space. Partitioning the shared memory and communicating via messages is a safe way of programming such machines. However, accessing pointered data structures from a foreign memory partition is expensive due to the required serialization. In this paper, we propose a novel data transfer technique that avoids serialization overhead for pointered data structures by managing cache coherence in software at object granularity. We show that for PGAS programming languages, the compiler and runtime system can completely handle the necessary cache management, thus requiring no changes to application code. Moreover, we explain how cache operations working on address ranges complement our data transfer technique. We propose a novel non-blocking implementation of range-based cache operations by offloading them to an enhanced cache controller. We evaluate our approach on a non-cache-coherent many-core architecture using a distributed-kernel benchmark suite and demonstrate a reduction of communication time of up to 39.8%.

workshop on algorithms and data structures | 2015

Optimal Shuffle Code with Permutation Instructions

Sebastian Buchwald; Manuel Mohr; Ignaz Rutter

During compilation of a program, register allocation is the task of mapping program variables to machine registers. During register allocation, the compiler may introduce shuffle code, consisting of copy and swap operations, that transfers data between the registers. Three common sources of shuffle code are conflicting register mappings at joins in the control flow of the program, e.g, due to if-statements or loops; the calling convention for procedures, which often dictates that input arguments or results must be placed in certain registers; and machine instructions that only allow a subset of registers to occur as operands.

ACM Transactions on Programming Languages and Systems | 2014