Stijn Eyerman
Ghent University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Stijn Eyerman.
international symposium on microarchitecture | 2008
Stijn Eyerman; Lieven Eeckhout
Assessing the performance of multiprogram workloads running on multithreaded hardware is difficult because it involves a balance between single-program performance and overall system performance. This article argues for developing multiprogram performance metrics in a top-down fashion starting from system-level objectives. The authors propose two performance metrics: average normalized turnaround time, a user-oriented metric, and system throughput, a system-oriented metric.
ACM Transactions on Computer Systems | 2009
Stijn Eyerman; Lieven Eeckhout; Tejas Karkhanis; James E. Smith
A mechanistic model for out-of-order superscalar processors is developed and then applied to the study of microarchitecture resource scaling. The model divides execution time into intervals separated by disruptive miss events such as branch mispredictions and cache misses. Each type of miss event results in characterizable performance behavior for the execution time interval. By considering an intervals type and length (measured in instructions), execution time can be predicted for the interval. Overall execution time is then determined by aggregating the execution time over all intervals. The mechanistic model provides several advantages over prior modeling approaches, and, when estimating performance, it differs from detailed simulation of a 4-wide out-of-order processor by an average of 7%. The mechanistic model is applied to the general problem of resource scaling in out-of-order superscalar processors. First, we use the model to determine size relationships among microarchitecture structures in a balanced processor design. Second, we use the mechanistic model to study scaling of both pipeline depth and width in balanced processor designs. We corroborate previous results in this area and provide new results. For example, we show that at optimal design points, the pipeline depth times the square root of the processor width is nearly constant. Finally, we consider the behavior of unbalanced, overprovisioned processor designs based on insight gained from the mechanistic model. We show that in certain situations an overprovisioned processor may lead to improved overall performance. Designs where a processors dispatch width is wider than its issue width are of particular interest.
international symposium on computer architecture | 2010
Stijn Eyerman; Lieven Eeckhout
This paper presents a fundamental law for parallel performance: it shows that parallel performance is not only limited by sequential code (as suggested by Amdahls law) but is also fundamentally limited by synchronization through critical sections. Extending Amdahls software model to include critical sections, we derive the surprising result that the impact of critical sections on parallel performance can be modeled as a completely sequential part and a completely parallel part. The sequential part is determined by the probability for entering a critical section and the contention probability (i.e., multiple threads wanting to enter the same critical section). This fundamental result reveals at least three important insights for multicore design. (i) Asymmetric multicore processors deliver less performance benefits relative to symmetric processors than suggested by Amdahls law, and in some cases even worse performance. (ii) Amdahls law suggests many tiny cores for optimum performance in asymmetric processors, however, we find that fewer but larger small cores can yield substantially better performance. (iii) Executing critical sections on the big core can yield substantial speedups, however, performance is sensitive to the accuracy of the critical section contention predictor.
high-performance computer architecture | 2010
Davy Genbrugge; Stijn Eyerman; Lieven Eeckhout
Detailed architectural simulators suffer from a long development cycle and extremely long evaluation times. This longstanding problem is further exacerbated in the multi-core processor era. Existing solutions address the simulation problem by either sampling the simulated instruction stream or by mapping the simulation models on FPGAs; these approaches achieve substantial simulation speedups while simulating performance in a cycle-accurate manner. This paper proposes interval simulation which takes a completely different approach: interval simulation raises the level of abstraction and replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model estimates core-level performance by analyzing intervals, or the timing between two miss events (branch mispredictions and TLB/cache misses); the miss events are determined through simulation of the memory hierarchy, cache coherence protocol, interconnection network and branch predictor. By raising the level of abstraction, interval simulation reduces both development time and evaluation time. Our experimental results using the SPEC CPU2000 and PARSEC benchmark suites and the M5 multi-core simulator, show good accuracy up to eight cores (average error of 4.6% and max error of 11% for the multi-threaded full-system workloads), while achieving a one order of magnitude simulation speedup compared to cycle-accurate simulation. Moreover, interval simulation is easy to implement: our implementation of the mechanistic analytical model incurs only one thousand lines of code. Its high accuracy, fast simulation speed and ease-of-use make interval simulation a useful complement to the architects toolbox for exploring system-level and high-level micro-architecture trade-offs.
ACM Transactions on Architecture and Code Optimization | 2014
Trevor E. Carlson; Wim Heirman; Stijn Eyerman; Ibrahim Hur; Lieven Eeckhout
Large core counts and complex cache hierarchies are increasing the burden placed on commonly used simulation and modeling techniques. Although analytical models provide fast results, they do not apply to complex, many-core shared-memory systems. In contrast, detailed cycle-level simulation can be accurate but also tends to be slow, which limits the number of configurations that can be evaluated. A middle ground is needed that provides for fast simulation of complex many-core processors while still providing accurate results. In this article, we explore, analyze, and compare the accuracy and simulation speed of high-abstraction core models as a potential solution to slow cycle-level simulation. We describe a number of enhancements to interval simulation to improve its accuracy while maintaining simulation speed. In addition, we introduce the instruction-window centric (IW-centric) core model, a new mechanistic core model that bridges the gap between interval simulation and cycle-accurate simulation by enabling high-speed simulations with higher levels of detail. We also show that using accurate core models like these are important for memory subsystem studies, and that simple, naive models, like a one-IPC core model, can lead to misleading and incorrect results and conclusions in practical design studies. Validation against real hardware shows good accuracy, with an average single-core error of 11.1% and a maximum of 18.8% for the IW-centric model with a 1.5× slowdown compared to interval simulation.
ACM Transactions on Architecture and Code Optimization | 2011
Stijn Eyerman; Lieven Eeckhout
Limit studies on Dynamic Voltage and Frequency Scaling (DVFS) provide apparently contradictory conclusions. On the one hand early limit studies report that DVFS is effective at large timescales (on the order of million(s) of cycles) with large scaling overheads (on the order of tens of microseconds), and they conclude that there is no need for small overhead DVFS at small timescales. Recent work on the other hand—motivated by the surge of on-chip voltage regulator research—explores the potential of fine-grained DVFS and reports substantial energy savings at timescales of hundreds of cycles (while assuming no scaling overhead). This article unifies these apparently contradictory conclusions through a DVFS limit study that simultaneously explores timescale and scaling speed. We find that coarse-grained DVFS is unaffected by timescale and scaling speed, however, fine-grained DVFS may lead to substantial energy savings for memory-intensive workloads. Inspired by these insights, we subsequently propose a fine-grained microarchitecture-driven DVFS mechanism that scales down voltage and frequency upon individual off-chip memory accesses using on-chip regulators. Fine-grained DVFS reduces energy consumption by 12% on average and up to 23% over a collection of memory-intensive workloads for an aggressively clock-gated processor, while incurring an average 0.08% performance degradation (and at most 0.14%). We also demonstrate that the proposed fine-grained DVFS mechanism is orthogonal to existing coarse-grained DVFS policies, and further reduces energy by 6% on average and up to 11% for memory-intensive applications with limited performance impact (at most 0.7%).
high-performance computer architecture | 2007
Stijn Eyerman; L. Ecckhout
A thread executing on a simultaneous multithreading (SMT) processor that experiences a long-latency load will eventually stall while holding execution resources. Existing long-latency load aware SMT fetch policies limit the amount of resources allocated by a staffed thread by identifying long-latency loads and preventing the given thread from fetching more instructions - and in some implementations, instructions beyond the long-latency load may even be flushed which frees allocated resources. This paper proposes an SMT fetch policy that hikes into account the available memory-level parallelism (MLP) in a thread. The key idea proposed in this paper is that in case of an isolated long-latency had. i.e. there is no MLP the thread should be prevented from allocating additional resources. However, in case multiple independent long-latency loads overlap, i.e., there is MLP the thread should allocate as many resources as needed in order to fully expose the available MLP. The proposed MLP-aware fetch policy achieves better performance for MLP-intensive threads on an SMT processor and achieves a better overall balance between performance and fairness than previously proposed fetch policies
IEEE Transactions on Computers | 2010
Stijn Eyerman; Lieven Eeckhout
Dynamic voltage and frequency scaling (DVFS) is a well known and effective technique for reducing power consumption in modern microprocessors. An important concern though is to estimate its profitability in terms of performance and energy. Current DVFS profitability estimation approaches, however, lack accuracy or incur runtime performance and/or energy overhead. This paper proposes a counter architecture for online DVFS profitability estimation on superscalar out-of-order processors. The counter architecture teases apart the fraction of the execution time that is susceptible to clock frequency versus the fraction that is insusceptible to clock frequency. By doing so, the counter architecture can accurately estimate the performance and energy consumption at different V/f operating points from a single program execution. The DVFS counter architecture estimates performance, energy consumption, and energy-delaysquared-product (ED2P) within 0.2, 0.5, and 0.8 percent on average, respectively, over a 4x frequency range. Further, the counter architecture incurs a small hardware cost and is an enabler for online DVFS scheduling both at the intracore as well as at the intercore level in a multicore processor.
architectural support for programming languages and operating systems | 2009
Stijn Eyerman; Lieven Eeckhout
This paper proposes a cycle accounting architecture for Simultaneous Multithreading (SMT) processors that estimates the execution times for each of the threads had they been executed alone, while they are running simultaneously on the SMT processor. This is done by accounting each cycle to either a base, miss event or waiting cycle component during multi-threaded execution. Single-threaded alone execution time is then estimated as the sum of the base and miss event components; the waiting cycle component represents the lost cycle count due to SMT execution. The cycle accounting architecture incurs reasonable hardware cost (around 1KB of storage) and estimates single-threaded performance with average prediction errors around 7.2% for two-program workloads and 11.7% for four-program workloads. The cycle accounting architecture has several important applications to system software and its interaction with SMT hardware. For one, the estimated single-thread alone execution time provides an accurate picture to system software of the actually consumed processor cycles per thread. The alone execution time instead of the total execution time (timeslice) may make system software scheduling policies more effective. Second, a new class of thread-progress aware SMT fetch policies based on per-thread progress indicators enable system software level priorities to be enforced at the hardware level.
design, automation, and test in europe | 2006
Stijn Eyerman; Lieven Eeckhout; K. De Bosschere
Previous work on efficient customized processor design primarily focused on in-order architectures. However, with the recent introduction of out-of-order processors for high-end high-performance embedded applications, researchers and designers need to address how to automate the design process of customized out-of-order processors. Because of the parallel execution of independent instructions in out-of-order processors, in-order processor design methodologies which subdivide the search space in independent components are unlikely to be effective in terms of accuracy for designing out-of-order processors. In this paper we propose and evaluate various automated singleand multi-objective optimizations for exploring out-of-order processor designs. We conclude that the newly proposed genetic local search algorithm outperforms all other search algorithms in terms of accuracy. In addition, we propose two-phase simulation in which the first phase explores the design space through statistical simulation; a region of interest is then simulated through detailed simulation in the second phase. We show that simulation time speedups can be obtained of a factor 2.2times to 7.3times using two-phase simulation