Stephan Diestelhorst | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Stephan Diestelhorst is active.

Explore More

Publication

Featured researches published by Stephan Diestelhorst.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2017

Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs

Matthew J. Walker; Stephan Diestelhorst; Andreas Hansson; Anup Das; Sheng Yang; Bashir M. Al-Hashimi

Modern mobile and embedded devices are required to be increasingly energy-efficient while running more sophisticated tasks, causing the CPU design to become more complex and employ more energy-saving techniques. This has created a greater need for fast and accurate power estimation frameworks for both run-time CPU energy management and design-space exploration. We present a statistically rigorous and novel methodology for building accurate run-time power models using performance monitoring counters (PMCs) for mobile and embedded devices, and demonstrate how our models make more efficient use of limited training data and better adapt to unseen scenarios by uniquely considering stability. Our robust model formulation reduces multicollinearity, allows separation of static and dynamic power, and allows a 100× reduction in experiment time while sacrificing only 0.6% accuracy. We present a statistically detailed evaluation of our model, highlighting and addressing the problem of heteroscedasticity in power modeling. We present software implementing our methodology and build power models for ARM Cortex-A7 and Cortex-A15 CPUs, with 3.8% and 2.8% average error, respectively. We model the behavior of the nonideal CPU voltage regulator under dynamic CPU activity to improve modeling accuracy by up to 5.5% in situations where the voltage cannot be measured. To address the lack of research utilizing PMC data from real mobile devices, we also present our data acquisition method and experimental platform software. We support this paper with online resources including software tools, documentation, raw data and further results.

international conference on embedded computer systems architectures modeling and simulation | 2016

Exploring system performance using elastic traces: Fast, accurate and portable

Radhika Jagtap; Stephan Diestelhorst; Andreas Hansson; Matthias Jung; Norbert When

Simulation tools are indispensable to computer architects. Detailed execution-driven CPU models offer high accuracy, but at the cost of simulation speed. Trace-driven simulation is widely adopted to alleviate this problem, especially for studies focusing on memory-system exploration. Ideally, trace-driven core models will mimic out-of-order processors executing full-system workloads to enable computer architects to evaluate modern systems. Additionally, to be useful to the broader community the tracing and replay models should be publicly available. However, existing trace-driven approaches are limited in their applicability and availability. We propose elastic traces in which we accurately capture data and load/store order dependencies by instrumenting a detailed out-of-order processor model. In contrast to existing work, we do not rely on offline analysis of timestamps, and instead use accurate dependency information tracked inside the processor pipeline. We thereby account for the effects of speculation and branch misprediction resulting in a more accurate trace playback. We provide a trace player that honours the dependencies and thus adapts its execution time to memory-system changes, as would the actual CPU. Compared to the detailed CPU, our trace player achieves a speed-up of 6–8 times. When modifying the memory-system parameters, the average error in absolute execution time is 7% for SPEC 2006 benchmarks on a bare metal system and 17% for HPC benchmarks on Linux. Relative performance is predicted with less than 3% error, achieving fast and accurate system performance exploration. We make this functionality available to the broader community via a widely-used open source full-system simulator.

international symposium on performance analysis of systems and software | 2017

dist-gem5: Distributed simulation of computer clusters

Alian Mohammad; Umur Darbaz; Gabor Dozsa; Stephan Diestelhorst; Daehoon Kim; Nam Sung Kim

When analyzing a distributed computer system, we often observe that the complex interplay among processor, node, and network sub-systems can profoundly affect the performance and power efficiency of the distributed computer system. Therefore, to effectively cross-optimize hardware and software components of a distributed computer system, we need a full-system simulation infrastructure that can precisely capture the complex interplay. Responding to the aforementioned need, we present dist-gem5, a flexible, detailed, and open-source full-system simulation infrastructure that can model and simulate a distributed computer system using multiple simulation hosts. Then we validate dist-gem5 against a physical cluster and show that the latency and bandwidth of the simulated network sub-system are within 18% of the physical one. Compared with the single threaded and parallel versions of gem5, dist-gem5 speeds up the simulation of a 63-node computer cluster by 83.1x and 12.8x, respectively.

power and timing modeling optimization and simulation | 2016

Thermally-aware composite run-time CPU power models

Matthew J. Walker; Stephan Diestelhorst; Andreas Hansson; Domenico Balsamo; Bashir M. Al-Hashimi

Accurate and stable CPU power modelling is fundamental in modern system-on-chips (SoCs) for two main reasons: 1) they enable significant online energy savings by providing a run-time manager with reliable power consumption data for controlling CPU energy-saving techniques; 2) they can be used as accurate and trusted reference models for system design and exploration. We begin by showing the limitations in typical performance monitoring counter (PMC) based power modelling approaches and illustrate how an improved model formulation results in a more stable model that efficiently captures relationships between the input variables and the power consumption. Using this as a solid foundation, we present a methodology for adding thermal-awareness and analytically decomposing the power into its constituting parts. We develop and validate our methodology using data recorded from a quad-core ARM Cortex-A15 mobile CPU and we achieve an average prediction error of 3.7% across 39 diverse workloads, 8 Dynamic Voltage-Frequency Scaling (DVFS) levels and with a CPU temperature ranging from 31° C to 91° C. Moreover, we measure the effect of switching cores offline and decompose the existing power model to estimate the static power of each CPU and L2 cache, the dynamic power due to constant background (BG) switching, and the dynamic power caused by the activity of each CPU individually. Finally, we provide our model equations and software tools for implementing in a run-time manager or for using with an architectural simulator, such as gem5.

international symposium on performance analysis of systems and software | 2016

Elastic traces for fast and accurate system performance exploration

Radhika Jagtap; Stephan Diestelhorst; Andreas Hansson

As computer systems become increasingly complex, the need for fast and accurate simulation tools increases. Accurate but slow processor core models are often substituted with simple trace players to achieve faster memory-system simulation. However, existing trace-driven simulation techniques are limited in their applicability and availability. In this work, we capture elastic traces containing out-of-order core dependencies and effects of speculative execution, which overcome limitations of existing work. Additionally, we make our capture and replay modelling available in the gem5 simulator. Our trace-driven CPU achieves a speed-up of 6-8x compared to the reference core and predicts the performance with less than 1% error on average when the memory-system is changed.

ACM Transactions on Architecture and Code Optimization | 2018

SynchroTrace: Synchronization-Aware Architecture-Agnostic Traces for Lightweight Multicore Simulation of CMP and HPC Workloads

Karthik Sangaiah; Michael Lui; Radhika Jagtap; Stephan Diestelhorst; Siddharth Nilakantan; Ankit More; Baris Taskin; Mark Hempstead

Trace-driven simulation of chip multiprocessor (CMP) systems offers many advantages over execution-driven simulation, such as reducing simulation time and complexity, allowing portability, and scalability. However, trace-based simulation approaches have difficulty capturing and accurately replaying multithreaded traces due to the inherent nondeterminism in the execution of multithreaded programs. In this work, we present SynchroTrace, a scalable, flexible, and accurate trace-based multithreaded simulation methodology. By recording synchronization events relevant to modern threading libraries (e.g., Pthreads and OpenMP) and dependencies in the traces, independent of the host architecture, the methodology is able to accurately model the nondeterminism of multithreaded programs for different hardware platforms and threading paradigms. Through capturing high-level instruction categories, the SynchroTrace average CPI trace Replay timing model offers fast and accurate simulation of many-core in-order CMPs. We perform two case studies to validate the SynchroTrace simulation flow against the gem5 full-system simulator: (1) a constraint-based design space exploration with traditional CMP benchmarks and (2) a thread-scalability study with HPC-representative applications. The results from these case studies show that (1) our trace-based approach with trace filtering has a peak speedup of up to 18.7× over simulation in gem5 full-system with an average of 9.6× speedup, (2) SynchroTrace maintains the thread-scaling accuracy of gem5 and can efficiently scale up to 64 threads, and (3) SynchroTrace can trace in one platform and model any platform in early stages of design.

power and timing modeling optimization and simulation | 2017

Empirical CPU power modelling and estimation in the gem5 simulator

Basireddy Karunakar Reddy; Matthew J. Walker; Domenico Balsamo; Stephan Diestelhorst; Bashir M. Al-Hashimi

Power modelling is important for modern CPUs to inform power management approaches and allow design space exploration. Power simulators, combined with a full-system architectural simulator such as gem5, enable power-performance trade-offs to be investigated early in the design of a system with different configurations (e.g number of cores, cache size, etc.). However, the accuracy of existing power simulators, such as McPAT, is known to be low due to the abstraction and specification errors, and this can lead to incorrect research conclusions. In this paper, we present an accurate power model, built from measured data, integrated into gem5 for estimating the power consumption of a simulated quad-core ARM Cortex-A15. A power modelling methodology based on Performance Monitoring Counters (PMCs) is used to build and evaluate the integrated model in gem5. We first validate this methodology on the real hardware with 60 workloads at nine Dynamic Voltage and Frequency Scaling (DVFS) levels and four core mappings (2,160 samples), showing an average error between estimated and real measured power of less than 6%. Correlation between gem5 activity statistics and hardware PMCs is investigated to build a gem5 model representing a quad-core ARM Cortex-A15. Experimental validation with 15 workloads at four DVFS levels on real hardware and gem5 has been conducted to understand how the difference between the gem5 simulated activity statistics and the hardware PMCs affects the estimated power consumption.

ACM Transactions in Embedded Computing Systems | 2017

Nucleus: Finding the Sharing Limit of Heterogeneous Cores

Ilias Vougioukas; Andreas Sandberg; Stephan Diestelhorst; Bashir M. Al-Hashimi

Heterogeneous multi-processors are designed to bridge the gap between performance and energy efficiency in modern embedded systems. This is achieved by pairing Out-of-Order (OoO) cores, yielding performance through aggressive speculation and latency masking, with In-Order (InO) cores, that preserve energy through simpler design. By leveraging migrations between them, workloads can therefore select the best setting for any given energy/delay envelope. However, migrations introduce execution overheads that can hurt performance if they happen too frequently. Finding the optimal migration frequency is critical to maximize energy savings while maintaining acceptable performance. We develop a simulation methodology that can 1) isolate the hardware effects of migrations from the software, 2) directly compare the performance of different core types, 3) quantify the performance degradation and 4) calculate the cost of migrations for each case. To showcase our methodology we run mibench, a microbenchmark suite, and show that migrations can happen as fast as every 100k instructions with little performance loss. We also show that, contrary to numerous recent studies, hypothetical designs do not need to share all of their internal components to be able to migrate at that frequency. Instead, we propose a feasible system that shares level 2 caches and a translation lookaside buffer that matches performance and efficiency. Our results show that there are phases comprising up to 10% that a migration to the OoO core leads to performance benefits without any additional energy cost when running on the InO core, and up to 6% of phases where a migration to the InO core can save energy without affecting performance. When considering a policy that focuses on improving the energy-delay product, results show that on average 66% of the phases can be migrated to deliver equal or better system operation without having to aggressively share the entire memory system or to revert to migration periods finer than 100k instructions.

Archive | 2016