Andreas Sembrant | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andreas Sembrant is active.

Explore More

Publication

Featured researches published by Andreas Sembrant.

ieee international symposium on workload characterization | 2011

Efficient software-based online phase classification

Andreas Sembrant; David Eklov; Erik Hagersten

Many programs exhibit execution phases with time-varying behavior. Phase detection has been used extensively to find short and representative simulation points, used to quickly get representative simulation results for long-running applications. Several proposals for hardware-assisted phase detection have also been proposed to guide various forms of optimizations and hardware configurations.

international symposium on microarchitecture | 2013

TLC: a tag-less cache for reducing dynamic first level cache energy

Andreas Sembrant; Erik Hagersten; David Black-Shaffer

First level caches are performance-critical and are therefore optimized for speed. To do so, modern processors reduce the miss ratio by using set-associative caches and optimize latency by reading all ways in parallel with the TLB and tag lookup. However, this wastes energy since only data from one way is actually used. To reduce energy, phased-caches and way-prediction techniques have been proposed wherein only data of the matching/predicted way is read. These optimizations increase latency and complexity, making them less attractive for first level caches. Instead of adding new functionality on top of a traditional cache, we propose a new cache design that adds way index information to the TLB. This allow us to: 1) eliminate extra data array reads (by reading the right way directly), 2) avoid tag comparisons (by eliminating the tag array), 3) filter out misses (by checking the TLB), and 4) amortize the TLB lookup energy (by integrating it with the way information). In addition, the new cache can directly replace existing caches without any modification to the processor core or software. This new Tag-Less Cache (TLC) reduces the dynamic energy for a 32 kB, 8-way cache by 78% compared to a VIPT cache without affecting performance.

modeling, analysis, and simulation on computer and telecommunication systems | 2012

Power-Sleuth: A Tool for Investigating Your Program's Power Behavior

Vasileios Spiliopoulos; Andreas Sembrant; Stefanos Kaxiras

Modern processors support aggressive power saving techniques to reduce energy consumption. However, traditional profiling techniques have mainly focused on performance, which does not accurately reflect the power behavior of applications. For example, the longest running function is not always the most energy-hungry function. Thus software developers cannot always take full advantage of these power-saving features. We present Power-Sleuth, a power/performance estimation tool which is able to provide a full description of an applications behavior for any frequency from a single profiling run. The tool combines three techniques: a power and a performance estimation model with a program phase detection technique to deliver accurate, per-phase, per-frequency analysis. Our evaluation (against real power measurements) shows that we can accurately predict power and performance across different frequencies with average errors of 3.5% and 3.9% respectively.

symposium on code generation and optimization | 2012

Phase guided profiling for fast cache modeling

Andreas Sembrant; David Black-Schaffer; Erik Hagersten

Statistical cache models are powerful tools for understanding application behavior as a function of cache allocation. However, previous techniques have modeled only the average application behavior, which hides the effect of program variations over time. Without detailed time-based information, transient behavior, such as exceeding bandwidth or cache capacity, may be missed. Yet these events, while short, often play a disproportionate role and are critical to understanding program behavior. In this work we extend earlier techniques to incorporate program phase information when collecting runtime profiling data. This allows us to model an applications cache miss ratio as a function of its cache allocation over time. To reduce overhead and improve accuracy we use online phase detection and phase-guided profiling. The phase-guided profiling reduces overhead by more intelligently selecting portions of the application to sample, while accuracy is improved by combining samples from different instances of the same phase. The result is a new technique that accurately models the time-varying behavior of an applications miss ratio as a function of its cache allocation on modern hardware. By leveraging phase-guided profiling, this work both improves on the accuracy of previous techniques and reduces the overhead.

ieee international symposium on workload characterization | 2012

Phase behavior in serial and parallel applications

Andreas Sembrant; David Black-Schaffer; Erik Hagersten

It is well known that most serial programs exhibit time varying behavior, for example, alternating between memory- and compute-bound phases. However, most research into program phase behavior has focused on the serial SPEC benchmark suite, with little investigations into large scale phase behavior in parallel applications. In this study we compare and examine the time-varying behavior of the SPEC2006 (serial) and the PARSEC 2.1 (parallel) benchmarks suites, and investigate the program phase behavior found in parallel applications with different parallelization models. To this end, we extend a general purpose runtime phase desection library to handle parallel applications. Our results reveal that serial applications have significantly more program phases (2.4x) with larger variation in CPI (1.5x) compared to parallel applications. While the number of phases are fewer in parallel applications, there still exists interesting phase behavior. In particular, we find that data-parallel applications have shorter phases with more threads. This makes phase-guided runtime optimizations (e.g., dynamic voltage frequency scaling) less attractive as the number of threads grows. Meaning it is much more difficult to exploit runtime optimizations in parallel applications.

ACM Sigarch Computer Architecture News | 2014

The Direct-to-Data (D2D) cache: navigating the cache hierarchy with a single lookup

Andreas Sembrant; Erick Hagersten; David Black-Schaffer

Modern processors optimize for cache energy and performance by employing multiple levels of caching that address bandwidth, low-latency and high-capacity. A request typically traverses the cache hierarchy, level by level, until the data is found, thereby wasting time and energy in each level. In this paper, we present the Direct-to-Data (D2D) cache that locates data across the entire cache hierarchy with a single lookup.

symposium on computer architecture and high performance computing | 2012

Low Overhead Instruction-Cache Modeling Using Instruction Reuse Profiles

Muneeb Khan; Andreas Sembrant; Erik Hagersten

Performance loss caused by L1 instruction cache misses varies between different architectures and cache sizes. For processors employing power-efficient in-order execution with small caches, performance can be significantly affected by instruction cache misses. The growing use of low-power multi-threaded CPUs (with shared L1 caches) in general purpose computing platforms requires new efficient techniques for analyzing application instruction cache usage. Such insight can be achieved using traditional simulation technologies modeling several cache sizes, but the overhead of simulators may be prohibitive for practical optimization usage. In this paper we present a statistical method to quickly model application instruction cache performance. Most importantly we propose a very low-overhead sampling mechanism to collect runtime data from the applications instruction stream. This data is fed to the statistical model which accurately estimates the instruction cache miss ratio for the sampled execution. Our sampling method is about 10x faster than previously suggested sampling approaches, with average runtime overhead as low as 25% over native execution. The architecturally-independent data collected is used to accurately model miss ratio for several cache sizes simultaneously, with average absolute error of 0.2%. Finally, we show how our tool can be used to identify program phases with large instruction cache footprint. Such phases can then be targeted to optimize for reduced code footprint.

international symposium on computer architecture | 2015

Cost-effective speculative scheduling in high performance processors

Arthur Perais; André Seznec; Pierre Michaud; Andreas Sembrant; Erik Hagersten

To maximize peiformance, out-of-order execution processors sometimes issue instructions without having the guarantee that operands will be available in time; e.g. loads are typically assumed to hit in the LI cache and dependent instructions are issued accordingly. This form of speculation - that we refer to as speculative scheduling - has been used for two decades in real processors, but has received little attention from the research community. In particular, as pipeline depth grows, and the distance between the Issue and the Execute stages increases, it becomes critical to issue instructions dependent on variable-latency instructions as soon as possible rather than wait for the actual cycle at which the result becomes available. Unfortunately, due to the uncertain nature of speculative scheduling, the scheduler may wrongly issue an instruction that will not have its source( s) available on the bypass network when it reaches the Execute stage. In that event, the instruction is canceled and replayed, potentially impairing peiformance and increasing energy consumption. In this work, we do not present a new replay mechanism. Rather, we focus on ways to reduce the number of replays that are agnostic of the replay scheme. First, we propose an easily implementable, low-cost solution to reduce the number of replays caused by Ll bank conflicts. Schedule shifting always assumes that, given a dual-load issue capacity, the second load issued in a given cycle will be delayed because of a bank conflict. Its dependents are thus always issued with the corresponding delay. Second, we also improve on existing Ll hit/miss prediction schemes by taking into account instruction criticality. That is, for some criterion of criticality and for loads whose hit/miss behavior is hard to predict, we show that it is more cost-effective to stall dependents if the load is not predicted critical.

high-performance computer architecture | 2017

A Split Cache Hierarchy for Enabling Data-Oriented Optimizations

Andreas Sembrant; Erik Hagersten; David Black-Schaffer

Todays caches tightly couple data with metadata (Address Tags) at the cache line granularity. The co-location of data and its identifying metadata means that they require multiple approaches to locate data (associative way searches and level-by-level searches), evict data (coherent writebacks buffers and associative level-by-level searches) and keep data coherent (directory indirections and associative level-by-level searches). This results in complex implementations with many corner cases, increased latency and energy, and limited flexibility for data optimizations. We propose splitting the metadata and data into two separate structures: a metadata hierarchy and a data hierarchy. Themetadata hierarchy tracks the location of the data in the data hierarchy. This allows us to easily apply many differentoptimizations to the data hierarchy, including smart data placement, dynamic coherence, and direct accesses. The new split cache hierarchy, Direct-to-Master (D2M), provides a unified mechanism for cache searching, eviction, and coherence, that eliminates level-by-level data movement and searches, associative cache address tags comparisons andabout 90% of the indirections through a central directory. Optimizations such as moving LLC slices to the near-side ofthe network and private/shared data classification can easily be built on top off D2M to further improve its efficiency. Thisapproach delivers a 54% improvement in cache hierarchy EDP vs. a mobile processor and 40% vs. a server processor, reducesnetwork traffic by an average of 70%, reduces the L1 miss latency by 30% and is especially effective for workloads with high cache pressure.

international symposium on computer architecture | 2014