Trevor E. Carlson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Trevor E. Carlson is active.

Explore More

Publication

Featured researches published by Trevor E. Carlson.

ieee international conference on high performance computing data and analytics | 2011

Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation

Trevor E. Carlson; Wim Heirmant; Lieven Eeckhout

Two major trends in high-performance computing, namely, larger numbers of cores and the growing size of on-chip cache memory, are creating significant challenges for evaluating the design space of future processor architectures. Fast and scalable simulations are therefore needed to allow for sufficient exploration of large multi-core systems within a limited simulation time budget. By bringing together accurate high-abstraction analytical models with fast parallel simulation, architects can trade off accuracy with simulation speed to allow for longer application runs, covering a larger portion of the hardware design space. Interval simulation provides this balance between detailed cycle-accurate simulation and one-IPC simulation, allowing long-running simulations to be modeled much faster than with detailed cycle-accurate simulation, while still providing the detail necessary to observe core-uncore interactions across the entire system. Validations against real hardware show average absolute errors within 25% for a variety of multi-threaded workloads; more than twice as accurate on average as one-IPC simulation. Further, we demonstrate scalable simulation speed of up to 2.0 MIPS when simulating a 16-core system on an 8-core SMP machine.

ACM Transactions on Architecture and Code Optimization | 2014

An Evaluation of High-Level Mechanistic Core Models

Trevor E. Carlson; Wim Heirman; Stijn Eyerman; Ibrahim Hur; Lieven Eeckhout

Large core counts and complex cache hierarchies are increasing the burden placed on commonly used simulation and modeling techniques. Although analytical models provide fast results, they do not apply to complex, many-core shared-memory systems. In contrast, detailed cycle-level simulation can be accurate but also tends to be slow, which limits the number of configurations that can be evaluated. A middle ground is needed that provides for fast simulation of complex many-core processors while still providing accurate results. In this article, we explore, analyze, and compare the accuracy and simulation speed of high-abstraction core models as a potential solution to slow cycle-level simulation. We describe a number of enhancements to interval simulation to improve its accuracy while maintaining simulation speed. In addition, we introduce the instruction-window centric (IW-centric) core model, a new mechanistic core model that bridges the gap between interval simulation and cycle-accurate simulation by enabling high-speed simulations with higher levels of detail. We also show that using accurate core models like these are important for memory subsystem studies, and that simple, naive models, like a one-IPC core model, can lead to misleading and incorrect results and conclusions in practical design studies. Validation against real hardware shows good accuracy, with an average single-core error of 11.1% and a maximum of 18.8% for the IW-centric model with a 1.5× slowdown compared to interval simulation.

international symposium on performance analysis of systems and software | 2013

Sampled simulation of multi-threaded applications

Trevor E. Carlson; Wim Heirman; Lieven Eeckhout

Sampling is a well-known workload reduction technique that allows one to speed up architectural simulation while accurately predicting performance. Previous sampling methods have been shown to accurately predict single-threaded application runtime based on its overall IPC. However, these previous approaches are unsuitable for general multi-threaded applications, for which IPC is not a good proxy for runtime. Additionally, we find that issues such as application periodicity and inter-thread synchronization play a significant role in determining how best to sample these applications. The proposed multi-threaded application sampling methodology is able to derive an effective sampling strategy for candidate applications using architecture-independent metrics. Using this methodology, large input sets can now be simulated which would otherwise be infeasible, allowing for more accurate conclusions to be made than from studies using scaled-down input sets. Through the use of the proposed methodology, we can simulate less than 10% of the total application runtime in detail. On the SPEComp, NPB and PARSEC benchmarks, running on an 8-core simulated system, we achieve an average absolute error of 3.5%.

design, automation, and test in europe | 2009

System-level power/performance evaluation of 3D stacked DRAMs for mobile applications

Marco Facchini; Trevor E. Carlson; Anselme Vignon; Martin Palkovic; Francky Catthoor; Wim Dehaene; Luca Benini; Paul Marchal

Convergence of communication, consumer applications and computing within mobile systems pushes memory requirements both in terms of size, bandwidth and power consumption. The existing solution for the memory bottle-neck is to increase the amount of on-chip memory. However, this solution is becoming prohibitively expensive, allowing 3D stacked DRAM to become an interesting alternative for mobile applications. In this paper, we examine the power/performance benefits for three different 3D stacked DRAM scenarios. Our high-level memory and Through Silicon Via (TSV) models have been calibrated on state-of-the-art industrial processes. We model the integration of a logic die with TSVs on top of both an existing DRAM and a DRAM with redesigned transceivers for 3D. Finally, we take advantage of the interconnect density enabled by 3D technology to analyze an ultra-wide memory interface. Experimental results confirm that TSV-based 3D integration is a promising technology option for future mobile applications, and that its full potential can be unleashed by jointly optimizing memory architecture and interface logic.

international conference on parallel architectures and compilation techniques | 2012

Power-aware multi-core simulation for early design stage hardware/software co-optimization

Wim Heirman; Souradip Sarkar; Trevor E. Carlson; Ibrahim Hur; Lieven Eeckhout

Stringent performance targets and power constraints push designers towards building specialized workload-optimized systems across a broad spectrum of the computing arena, including supercomputing applications as exemplified by the IBM BlueGene and Intel MIC architectures. In this paper, we make the case for hardware/software co-design during early design stages of processors for scientific computing applications. Considering an important scientific kernel, namely stencil computation, we demonstrate that performance and energy-efficiency can be improved by a factor of 1.66× and 1.25×, respectively, by co-optimizing hardware and software. To enable hardware/software co-design in early stages of the design cycle, we propose a novel simulation infrastructure by combining high-abstraction performance simulation using Sniper with power modeling using McPAT and custom DRAM power models. Sniper/McPAT is fast — simulation speed is around 2 MIPS on an 8-core host machine — because it uses analytical modeling to abstract away core performance during multi-core simulation. We demonstrate Sniper/McPATs accuracy through validation against real hardware; we report average performance and power prediction errors of 22.1% and 8.3%, respectively, for a set of SPEComp benchmarks.

ieee international symposium on workload characterization | 2011

Using cycle stacks to understand scaling bottlenecks in multi-threaded workloads

Wim Heirman; Trevor E. Carlson; Shuai Che; Kevin Skadron; Lieven Eeckhout

This paper proposes a methodology for analyzing parallel performance by building cycle stacks. A cycle stack quantifies where the cycles have gone, and provides hints towards optimization opportunities. We make the case that this is particularly interesting for analyzing parallel performance: understanding how cycle components scale with increasing core counts and/or input data set sizes leads to insight with respect to scaling bottlenecks due to synchronization, load imbalance, poor memory performance, etc. We present several case studies illustrating the use of cycle stacks. As a subsequent step, we further extend the methodology to analyze sets of parallel workloads using statistical data analysis, and perform a workload characterization to understand behavioral differences across benchmark suites. We analyze the SPLASH-2, PARSEC and Rodinia benchmark suites and conclude that the three benchmark suites cover similar areas in the workload space. However, scaling behavior of these benchmarks towards larger input sets and/or higher core counts is highly dependent on the benchmark, the way in which the inputs have been scaled, and on the machine configuration.

international symposium on performance analysis of systems and software | 2014

BarrierPoint: Sampled simulation of multi-threaded applications

Trevor E. Carlson; Wim Heirman; Kenzo Van Craeynest; Lieven Eeckhout

Sampling is a well-known technique to speed up architectural simulation of long-running workloads while maintaining accurate performance predictions. A number of sampling techniques have recently been developed that extend well-known single-threaded techniques to allow sampled simulation of multi-threaded applications. Unfortunately, prior work is limited to non-synchronizing applications (e.g., server throughput workloads); requires the functional simulation of the entire application using a detailed cache hierarchy which limits the overall simulation speedup potential; leads to different units of work across different processor architectures which complicates performance analysis; or, requires massive machine resources to achieve reasonable simulation speedups. In this work, we propose BarrierPoint, a sampling methodology to accelerate simulation by leveraging globally synchronizing barriers in multi-threaded applications. BarrierPoint collects microarchitecture-independent code and data signatures to determine the most representative inter-barrier regions, called barrierpoints. BarrierPoint estimates total application execution time (and other performance metrics of interest) through detailed simulation of these barrierpoints only, leading to substantial simulation speedups. Barrierpoints can be simulated in parallel, use fewer simulation resources, and define fixed units of work to be used in performance comparisons across processor architectures. Our evaluation of BarrierPoint using NPB and Parsec benchmarks reports average simulation speedups of 24.7× (and up to 866.6×) with an average simulation error of 0.9% and 2.9% at most. On average, BarrierPoint reduces the number of simulation machine resources needed by 78×.

high-performance computer architecture | 2014

Undersubscribed threading on clustered cache architectures

Wim Heirman; Trevor E. Carlson; Kenzo Van Craeynest; Ibrahim Hur; Aamer Jaleel; Lieven Eeckhout

Recent many-core processors such as Intels Xeon Phi and GPGPUs specialize in running highly scalable parallel applications at high performance while simultaneously embracing energy efficiency as a first-order design constraint. The traditional belief is that full utilization of all available cores also translates into the highest possible performance. In this paper, we study the effects of cache capacity conflicts and competition for shared off-chip bandwidth; and show that undersubscription, or not utilizing all cores, often yields significant increases in both performance and energy efficiency. Based on a detailed shared working set analysis we make the case for clustered cache architectures as an efficient design point for exploiting both data sharing and undersubscription, while providing low-latency and ease of implementation in many-core processors. We then propose ClusteR-aware Undersubscribed Scheduling of Threads (CRUST) which dynamically matches an applications working set size and off-chip bandwidth demands with the available on-chip cache capacity and off-chip bandwidth. CRUST improves application performance and energy efficiency by 15% on average, and up to 50%, for the NPB and SPEC OMP benchmarks. In addition, we make recommendations for the design of future many-core architectures, and show that taking the undersubscription usage model into account moves the optimum performance under the cores-versus-cache area tradeoff towards design points with more cores and less cache.

2009 IEEE International Conference on 3D System Integration | 2009

Automated Pathfinding tool chain for 3D-stacked integrated circuits: Practical case study

Dragomir Milojevic; Trevor E. Carlson; Kris Croes; Riko Radojcic; Diana F. Ragett; Dirk Seynhaeve; Federico Angiolini; Geert Van der Plas; Pol Marchal

New technologies for manufacturing 3D Stacked ICs offer numerous opportunities for the design of complex and effcient embedded systems. But these technologies also introduce many design options at system/chip design level, hard to grasp during the complete design cycle. Because of the sequential nature of current design practices, designers are often forced to introduce design margins to meet required specications, resulting in sub-optimal designs. In this paper we introduce new design methodology and practical tool chain, called PathFinding Flow, that can help designers to easily trade-off between different system level design choices, physical design and/or technology options and understand their impact on typical design parameters such as cost, performance and power. Proposed methodology and the tool chain will be demonstrated on a practical case study, involving fairly complex Multi-Processor System-on-Chip using Network-on-Chip for communication medium. With this example we will show how High-Level Synthesis can be used to quickly move from high-level to RTL models, necessary for accurate physical prototyping for both computation and communication. We will also show how the possibility of design iteration, through the mechanism of feedback based on physical information from physical prototyping, can improve design performance. Finally, we will show how we can move in no time from traditional 2D to 3D design and how we can measure benets of such design choice.

international symposium on performance analysis of systems and software | 2015

Micro-architecture independent analytical processor performance and power modeling

Sam Van den Steen; Sander De Pestel; Moncef Mechri; Stijn Eyerman; Trevor E. Carlson; David Black-Schaffer; Erik Hagersten; Lieven Eeckhout

Optimizing processors for specific application(s) can substantially improve energy-efficiency. With the end of Dennard scaling, and the corresponding reduction in energyefficiency gains from technology scaling, such approaches may become increasingly important. However, designing applicationspecific processors require fast design space exploration tools to optimize for the targeted application(s). Analytical models can be a good fit for such design space exploration as they provide fast performance estimations and insight into the interaction between an applications characteristics and the micro-architecture of a processor. Unfortunately, current analytical models require some microarchitecture dependent inputs, such as cache miss rates, branch miss rates and memory-level parallelism. This requires profiling the applications for each cache and branch predictor configuration, which is far more time-consuming than evaluating the actual performance models. In this work we present a micro-architecture independent profiler and associated analytical models that allow us to produce performance and power estimates across a large design space almost instantaneously. We show that using a micro-architecture independent profile leads to a speedup of 25× for our evaluated design space, compared to an analytical model that uses micro-architecture dependent profiles. Over a large design space, the model has a 13% error for performance and a 7% error for power, compared to cycle-level simulation. The model is able to accurately determine the optimal processor configuration for different applications under power or performance constraints, and it can provide insight into performance through cycle stacks.

Explore More