David Eklov
Uppsala University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David Eklov.
international symposium on performance analysis of systems and software | 2010
David Eklov; Erik Hagersten
Efficient execution on modern architectures requires good data locality, which can be measured by the powerful stack distance abstraction. Based on this abstraction, the miss rate for LRU caches of any size can be predicted. However, measuring stack distance requires the number of unique memory objects to be counted between successive accesses to the same data object, which requires complex and inefficient data collection.
ieee international conference on high performance computing data and analytics | 2010
Andreas Sandberg; David Eklov; Erik Hagersten
Contention for shared cache resources has been recognized as a major bottleneck for multicores--especially for mixed workloads of independent applications. While most modern processors implement instructions to manage caches, these instructions are largely unused due to a lack of understanding of how to best leverage them. This paper introduces a classification of applications into four cache usage categories. We discuss how applications from different categories affect each others performance indirectly through cache sharing and devise a scheme to optimize such sharing. We also propose a low-overhead method to automatically find the best per-instruction cache management policy. We demonstrate how the indirect cache-sharing effects of mixed workloads can be tamed by automatically altering some instructions to better manage cache resources. Practical experiments demonstrate that our software-only method can improve application performance up to 35% on x86 multicore hardware.
ieee international symposium on workload characterization | 2011
Andreas Sembrant; David Eklov; Erik Hagersten
Many programs exhibit execution phases with time-varying behavior. Phase detection has been used extensively to find short and representative simulation points, used to quickly get representative simulation results for long-running applications. Several proposals for hardware-assisted phase detection have also been proposed to guide various forms of optimizations and hardware configurations.
symposium on code generation and optimization | 2013
David Black-Schaffer; Nikos Nikoleris; Erik Hagersten; David Eklov
On multicore processors, co-executing applications compete for shared resources, such as cache capacity and memory bandwidth. This leads to suboptimal resource allocation and can cause substantial performance loss, which makes it important to effectively manage these shared resources. This, however, requires insights into how the applications are impacted by such resource sharing. While there are several methods to analyze the performance impact of cache contention, less attention has been paid to general, quantitative methods for analyzing the impact of contention for memory bandwidth. To this end we introduce the Bandwidth Bandit, a general, quantitative, profiling method for analyzing the performance impact of contention for memory bandwidth on multicore machines. The profiling data captured by the Bandwidth Bandit is presented in a bandwidth graph. This graph accurately captures the measured applications performance as a function of its available memory bandwidth, and enables us to determine how much the application suffers when its available bandwidth is reduced. To demonstrate the value of this data, we present a case study in which we use the bandwidth graph to analyze the performance impact of memory contention when co-running multiple instances of single threaded application.
international conference on parallel architectures and compilation techniques | 2010
David Eklov; David Black-Schaffer; Erik Hagersten
Chip multiprocessor (CMP) architectures sharing on chip resources, such as last-level caches, have recently become a mainstream computing platform. The performance of such systems can vary greatly depending on how co-scheduled applications compete for these shared resources. This work presents StatCC, a simple and efficient model for estimating the contention for shared cache resources between co-scheduled applications on chip multiprocessor architectures.
international symposium on performance analysis of systems and software | 2014
Nikos Nikoleris; David Eklov; Erik Hagersten
Simulators are widely used in computer architecture research. While detailed cycle-accurate simulations provide useful insights, studies using modern workloads typically require days or weeks. Evaluating many design points, only exacerbates the simulation overhead. Recent works propose methods with good accuracy that reduce the simulated overhead either by sampling the execution (e.g., SMARTS and SimPoint) or by using fast analytical models of the simulated designs (e.g., Interval Simulation). While these techniques reduce significantly the simulation overhead, modeling processor components with large state, such as the last-level cache, requires costly simulation to warm them up. Statistical simulation methods, such as SMARTS, report that the warm-up overhead accounts for 99% of the simulation overhead, while only 1% of the time is spent simulating the target design. This paper proposes WarmSim, a method that eliminates the need to warm up the cache. WarmSim builds on top of a statistical cache modeling technique and extends it to model accurately not only the miss ratio but also the outcome of every cache request. WarmSim uses as input, an applications memory reuse information which is hardware independent. Therefore, different cache configurations can be simulated using the same input data. We demonstrate that this approach can be used to estimate the CPI of the SPEC CPU2006 benchmarks with an average error of 1.77%, reducing the overhead compared to a simulation with a 10M instruction warm-up by a factor of 50x.
international symposium on performance analysis of systems and software | 2014
David Eklov; Nikos Nikoleris; Erik Hagersten
A key goodness metric of multi-threaded programs is how their execution times scale when increasing the number of threads. However, there are several bottlenecks that can limit the scalability of a multi-threaded program, e.g., contention for shared cache capacity and off-chip memory bandwidth; and synchronization overheads. In order to improve the scalability of a multi-threaded program, it is vital to be able to quantify how the program is impacted by these scalability bottlenecks. We present a software profiling method for obtaining speedup stacks. A speedup stack reports how much each scalability bottleneck limits the scalability of a multi-threaded program. It thereby quantifies how much its scalability can be improved by eliminating a given bottleneck. A software developer can use this information to determine what optimizations are most likely to improve scalability, while a computer architect can use it to analyze the resource demands of emerging workloads. The proposed method profiles the program on real commodity multi-cores (i.e., no simulations required) using existing performance counters. Consequently, the obtained speedup stacks accurately account for all idiosyncrasies of the machine on which the program is profiled. While the main contribution of this paper is the profiling method to obtain speedup stacks, we present several examples of how speedup stacks can be used to analyze the resource requirements of multi-threaded programs. Furthermore, we discuss how their scalability can be improved by both software developers and computer architects.
Archive | 2010
Erik Hagersten; David Eklov; David Black-Schaffer
Obtaining good application performance requires tuning for effective use of the cache hierarchy. However, most tools to analyze cache usage either generate architecture-specific results (e.g., hardware performance counters) or incur prohibitively high overheads for real-world workloads (e.g., trace-based simulations). This chapter reviews several recently introduced techniques that address these issues to efficiently model cache systems and coherent memory hierarchies in an architecturally independent manner. The techniques utilize only sparse, architecturally independent runtime information that can be collected with an overhead of 10–30%. This information is then processed by statistical models to quickly predict cache behavior across a range of architectures. With these approaches, accurate modeling is possible from data sampled with ’ as low as 1 in 106 memory accesses.
international conference on parallel processing | 2011
David Eklov; Nikos Nikoleris; David Black-Schaffer; Erik Hagersten
high performance embedded architectures and compilers | 2011
David Eklov; David Black-Schaffer; Erik Hagersten