Anselm Busse | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Anselm Busse is active.

Explore More

Publication

Featured researches published by Anselm Busse.

international conference on systems | 2011

The pitfalls of deploying solid-state drive RAIDs

Nikolaus Jeremic; Gero Mühl; Anselm Busse; Jan Richling

Solid-State Drives (SSDs) are about to radically change the way we look at storage systems. Without moving mechanical parts, they have the potential to supplement or even replace hard disks in performance-critical applications in the near future. Storage systems applied in such settings are usually built using RAIDs consisting of a bunch of individual drives for both performance and reliability reasons. Most existing work on SSDs, however, deals with the architecture at system level, the ash translation layer (FTL), and their influence on the overall performance of a single SSD device. Therefore, it is currently largely unclear whether RAIDs of SSDs exhibit different performance and reliability characteristics than those comprising hard disks and to which issues we have to pay special attention to ensure optimal operation in terms of performance and reliability. In this paper, we present a detailed analysis of SSD RAID configuration issues and derive several pitfalls for deploying SSDs in common RAID level configurations that can lead to severe performance degradation. After presenting potential solutions for each of these pitfalls, we concentrate on the particular challenge that SSDs can suffer from bad random write performance. We identify that over-provisioning offers a potential solution to this problem and validate the effectiveness of over-provisioning in common RAID level configurations by experiments whose results are compared to those of an analytical model that allows to approximately predict the random write performance of SSD RAIDs based on the characteristics of a single SSD. Our results show that over-provisioning is indeed an effective method that can increase random write performance in SSD RAIDs by more than an order of magnitude eliminating the potential Achilles heel of SSD-based storage systems.

international conference on parallel architectures and compilation techniques | 2014

kMAF: automatic kernel-level management of thread and data affinity

Matthias Diener; Eduardo Henrique Molina da Cruz; Philippe Olivier Alexandre Navaux; Anselm Busse; Hans-Ulrich Heiß

One of the main challenges for parallel architectures is the increasing complexity of the memory hierarchy, which consists of several levels of private and shared caches, as well as interconnections between separate memories in NUMA machines. To make full use of this hierarchy, it is necessary to improve the locality of memory accesses by reducing accesses to remote caches and memories, and using local ones instead. Two techniques can be used to increase the memory access locality: executing threads and processes that access shared data close to each other in the memory hierarchy (thread affinity), and placing the memory pages they access on the NUMA node they are executing on (data affinity). Most related work in this area focuses on either thread or data affinity, but not both, which limits the improvements. Other mechanisms require expensive operations, such as memory access traces or binary analysis, require changes to hardware or work only on specific parallel APIs. In this paper, we introduce kMAF, a mechanism that automatically manages thread and data affinity on the kernel level. The memory access behavior of the running application is determined during its execution by analyzing its page faults. This information is used by kMAF to migrate threads and memory pages, such that the overall memory access locality is optimized. Extensive evaluation with 27 benchmarks from 4 benchmark suites shows substantial performance improvements, with results close to an oracle mechanism. Execution time was reduced by up to 35.7% (13.8% on average), while energy efficiency was improved by up to 34.6% (9.3% on average).

parallel computing | 2015

Communication-aware process and thread mapping using online communication detection

Matthias Diener; Eduardo Henrique Molina da Cruz; Philippe Olivier Alexandre Navaux; Anselm Busse; Hans-Ulrich Heiíß

We perform online detection of inter-process and inter-thread communication.Detected communication pattern is used to migrate processes and threads.Operating System-based mechanism, no changes to applications or runtime libraries.We reduce execution time and energy consumption.Evaluation on shared memory machines and a cluster show substantial improvements. The rising complexity of memory hierarchies and interconnections in parallel shared memory architectures leads to differences in the communication performance. These differences can be exploited to perform a communication-aware mapping of parallel applications to the hardware topology, improving their performance and energy efficiency. To perform the mapping, it is necessary to determine the communication behavior of the processes and threads of the application. Previous methods rely on static communication traces to detect communication, require hardware changes or support only a subset of parallelization models.We propose CDSM, Communication Detection in Shared Memory, a mechanism that detects communication in from page faults and uses this information to perform the mapping. CDSM works on the operating system level during the execution of the parallel application and supports all parallelization models that use shared memory for communication. It does not require modifications to the applications, previous knowledge about their behavior, or changes to the hardware and runtime libraries. Experiments with the MPI, MPI+OpenMP and OpenMP implementations of the NAS parallel benchmarks, the HPCC benchmark and the PARSEC benchmark suite on a shared memory machine show that CDSM has a high detection accuracy with a negligible overhead. Execution time and processor energy consumption were reduced by up to 35.9% and 18.9%, respectively (10.2% and 7.3%, on average). Experiments on a cluster system, where CDSM optimizes the communication within each node, showed an average execution time reduction of 10.4%.

acm symposium on applied computing | 2013

Analyzing resource interdependencies in multi-core architectures to improve scheduling decisions

Anselm Busse; Jan Hendrik Schönherr; Matthias Diener; Gero Mühl; Jan Richling

Since the advent of multi-core processors, different multi-core system and in particular processor architectures have emerged exhibiting individual advantages and disadvantages. One of the main distinguishing factors among these architectures is their varying degree and type of resource sharing among individual cores. On the one hand, resource sharing is necessary for the cores to communicate, while on the other hand resource sharing is often used for economic reasons. Depending on the degree and type of resource sharing, the impact on performance depends on the workload applied and can vary to a large extend. In this paper, we investigate the impact of different kinds of resource interdependencies found in current processors on the performance of scheduling strategies using a set of benchmarks. Our results show that the architecture has a major impact on the performance of a process placement strategy. However, they also point out that simple strategies taking only a few basic architectural characteristics into account fall short. Thus, new holistic scheduling strategies are needed that take more characteristics into account.

IEEE Transactions on Parallel and Distributed Systems | 2016

Kernel-Based Thread and Data Mapping for Improved Memory Affinity

Matthias Diener; Eduardo Henrique Molina da Cruz; Marco Antonio Zanata Alves; Philippe Olivier Alexandre Navaux; Anselm Busse; Hans-Ulrich Heiss

Reducing the cost of memory accesses, both in terms of performance and energy consumption, is a major challenge in shared-memory architectures. Modern systems have deep and complex memory hierarchies with multiple cache levels and memory controllers, leading to a Non-Uniform Memory Access (NUMA) behavior. In such systems, there are two ways to improve the memory affinity: First, by mapping threads that share data to cores with a shared cache, cache usage and communication performance are optimized. Second, by mapping memory pages to memory controllers that perform the most accesses to them and are not overloaded, the average cost of accesses is reduced. We call these two techniques thread mapping and data mapping, respectively. Thread and data mapping should be performed in an integrated way to achieve a compounding effect that results in higher improvements overall. Previous work in this area requires expensive tracing operations to perform the mapping, or require changes to the hardware or to the parallel application. In this paper, we propose kMAF, a mechanism that performs integrated thread and data mapping in the kernel. kMAF uses the page faults of parallel applications to characterize their memory access behavior and performs the mapping during the execution of the application based on the detected behavior. In an evaluation with a large set of parallel benchmarks executing on three NUMA architectures, kMAF achieved substantial performance and energy efficiency improvements, close to an Oracle-based mechanism and significantly higher than previous proposals.

acm symposium on applied computing | 2012

Operating system support for dynamic over-provisioning of solid state drives

Nikolaus Jeremic; Gero Mühl; Anselm Busse; Jan Richling

Employing solid state drives (SSDs) can leverage the performance of persistent storage systems into a new dimension. However, in order to ensure a continuously high write throughput especially for small random writes, it is crucial to always maintain a substantial amount of free flash capacity. This can be achieved by additional over-provisioning or/and the TRIM command, which notifies an SSD of storage space no longer required. Since additional over-provisioning is disadvantageous as SSDs already have a higher cost per byte ratio compared to hard disk drives, using TRIM seems to be favorable. However, most intermediate software layers (e.g., filesystem encryption, software RAID drivers or a logical volume manager) but also hardware RAID controllers currently do not pass TRIM commands to the underlying devices making over-provisioning look like the only solution feasible now. In this paper, we tackle this problem by dynamic over-provisioning, allowing the OS to use additional storage capacity for a limited amount of time to supply more storage in peak demand situations. By doing this, we accept a temporarily degraded performance to be able to supply further storage capacity. After a peak situation, the utilized SSDs are notified using the TRIM command about allocated storage capacity that is no longer needed. By presenting experimental results, we show that dynamic over-provisioning is working and can be quite effective for both single SSDs and storage systems with multiple SSDs such as SSD RAIDs.

acm symposium on applied computing | 2016

Load-aware scheduling for heterogeneous multi-core systems

Mohannad Nabelsee; Anselm Busse; Helge Parzyjegla; Gero Mühl

Heterogeneous multi-core systems are becoming more and more common today. To be used to their full potential, the operating system has to be adapted to the new system environment. This is especially true for the scheduler as it is crucial to the overall system performance. In this paper, we present a scheduling approach for heterogeneous systems with two different kinds of cores. One that is very power efficient, but shows only a limited computing power, and the other one that has a very high performance and is very power consuming at the same time. We consider such heterogeneity for a centralized scheduler architecture. In our approach, we introduce a new load metric in order to classify tasks whether or not they are suited to be executed on a high-performance core. Based on this metric, we present a task state model for scheduling tasks according to their performance classification. We implemented the scheduling approach by extending the Brain Fuck Scheduler (BFS) and evaluated it on an eight core heterogeneous architecture with four low performance and four high-performance cores. The evaluation covers system responsiveness and high load behaviour compared to the vanilla BFS and the decentralized Completely Fair Scheduler (CFS). Even though our approach takes the heterogeneity into account, the results show that it scales better than the vanilla BFS while nearly maintaining its superior responsiveness.

symposium on computer architecture and high performance computing | 2015

CoBaS: Introducing a Component Based Scheduling Framework

Anselm Busse; Reinhardt Karnapke; Hans-Ulrich Heiss

Many-Core systems and heterogeneous systems are getting more and more common and may soon enter the mainstream market. To harvest their capabilities to their full potential, the runtime systems scheduling policies have to be adapted and, in many cases, tailored to the specific system. The runtime system can be both an operating system or management infrastructure of an infrastructure as a service (IaaS) platform. Developing, implementing, and testing those scheduling policies is a challenging task in general. In this work we present CoBaS, a component based scheduling framework for multi and many-core runtime systems. The main purpose of CoBaS is the simplification of the scheduling policy implementation and an increased code reuse to save time during development. CoBaS uses a novel approach to reach that goal. It allows the breakdown of the policy implementation into several components that can be reused. Through composition, a fast prototyping, testing and evaluation of new scheduling policies is possible without implementing every functional part again. CoBaS uses an event based approach to distribute information about system states and state changes between the runtime system and components as well as between components themselves. Furthermore, it has a facility to hand over ordered task sets between components. We have adapted both the Linux and Free BSD kernel to use CoBaS by completely removing the native scheduler. The integration of CoBaS into those kernels shows the feasibility of our approach.

networking architecture and storages | 2012

Dataset Management-Aware Software Architecture for Storage Systems Based on SSDs

Nikolaus Jeremic; Gero Mühl; Anselm Busse; Jan Richling

Solid-state drives (SSD) based on flash memory offer the opportunity to build high-performance storage systems with low energy consumption and high reliability. Crucial concerns of current SSDs are their write performance, especially for small random requests, and the limited lifespan of their flash memory. Both can be mitigated by providing an SSD with information about the stored data. This may include notifications about deal location of storage capacity or the prevalent access type. Knowing the size, latency requirements and type of upcoming requests can help to improve the dataset management (DSM) of an SSD allowing further performance improvement and memory lifetime extension. Often such information has to be passed through intermediate layers (e.g., RAIDs) placed between the information sources (e.g., file system) and the information sinks (i.e., the SSDs). Problems arise because such layers often do not appropriately propagate DSM commands making the intended optimizations unfeasible. In this paper, we propose a file system-independent, layered I/O software architecture that enables the handling of DSM commands throughout all of the layers. Moreover, it allows to address cross-cutting concerns related to the propagation of DSM commands. Its applicability is demonstrated by an exemplary instance based on stacked software RAID layers. The results of an experimental evaluation based on Linux Software RAID clearly show the benefits of the proposed architecture.

acm international conference on systems and storage | 2017

Simulation-based tracing and profiling for system software development

Anselm Busse; Reinhardt Karnapke; Helge Parzyjegla

Tracing and profiling low-level kernel functions (e.g. as found in the process scheduler) is a challenging task, though, necessary in both research and production in order to acquire detailed insights and achieve peak performance. Several kernel functions are known to be not traceable because of architectural limitations, whereas tracking other functions causes side effects and skews profiling results. In this paper, we present a novel, simulation-based approach to analyze the behavior and performance of kernel functions. Kernel code is executed on a simulated hardware platform avoiding the bias caused by collecting the tracing data within the system under observation. From the flat call trace generated by the simulator, we reconstruct the entire call graph and enrich it with detailed profiling statistics. Specifying regions of interest enables developers to systematically explore the system behavior and identify performance bottlenecks. As case study, we analyze the process scheduler of the Linux kernel. We are interested in quantifying the synchronization overhead caused by a growing number of CPU cores in a custom, semi-partitioned scheduler design. Conventional tracing methods were not able to obtain measurements with the required accuracy and granularity.

Explore More