Yiannakis Sazeides | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yiannakis Sazeides is active.

Explore More

Publication

Featured researches published by Yiannakis Sazeides.

international symposium on microarchitecture | 1997

The predictability of data values

Yiannakis Sazeides; James E. Smith

The predictability of data values is studied at a fundamental level. Two basic predictor models are defined: computational predictors perform an operation on previous values to yield predicted next values. Examples we study are stride value prediction (which adds a delta to a previous value) and last value prediction (which performs the trivial identity operation on the previous value). Context based predictors match recent value history (context) with previous value history and predict values based entirely on previously observed patterns. To understand the potential of value prediction we perform simulations with unbounded prediction tables that are immediately updated using correct data values. Simulations of integer SPEC95 benchmarks show that data values can be highly predictable. Best performance is obtained with context based predictors; overall prediction accuracies are between 56% and 91%. The context based predictor typically has an accuracy about 20% better than the computational predictors (last value and stride). Comparison of context based prediction and stride prediction shows that the higher accuracy of context based prediction is due to relatively few static instructions giving large improvements; this suggests the usefulness of hybrid predictors. Among different instruction types, predictability varies significantly. In general, load and shift instructions are more difficult to predict correctly, whereas add instructions are more predictable.

international symposium on computer architecture | 2002

Design tradeoffs for the alpha EV8 conditional branch predictor

André Seznec; Stephen Felix; Venkata Krishnan; Yiannakis Sazeides

This paper presents the Alpha EV8 conditional branch predictor The Alpha EV8 microprocessor project, canceled in June 2001 in a late phase of development, envisioned an aggressive 8-wide issue out-of-order superscalar microarchitecture featuring a very deep pipeline and simultaneous multithreading. Performance of such a processor is highly dependent on the accuracy of its branch predictor and consequently a very large silicon area was devoted to branch prediction on EV8. The Alpha EV8 branch predictor relies on global history and features a total of 352 Kbits.The focus of this paper is on the different trade-offs performed to overcome various implementation constraints for the EV8 branch predictor. One such instance is the pipelining of the predictor on two cycles to facilitate the prediction of up to 16 branches per cycle from any two dynamically successive, 8 instruction fetch blocks. This resulted in the use of three fetch-block old compressed branch history information for accesing the predictor. Implementation constraints also restricted the composition of the index functions for the predictor and forced the usage of only single-ported memory cells.Nevertheless, we show that the Alpha EV8 branch predictor achieves prediction accuracy in the same range as the state-of-the-art academic global history branch predictors that do not consider implementation constraints in great detail.

ACM Sigarch Computer Architecture News | 2005

Performance implications of single thread migration on a chip multi-core

Theofanis Constantinou; Yiannakis Sazeides; Pierre Michaud; Damien Fetis; André Seznec

High performance multi-core processors are becoming an industry reality. Although multi-cores are suited for multithreaded and multi-programmed workloads, many applications are still mono-thread and multi-core performance with a single thread workload is an important issue. Furthermore, recent studies suggest that performance, power and temperature considerations of future multi-cores may necessitate activity-migration between cores.Motivated by the above, this paper investigates the performance implications of single thread migration on a multi-core. Specifically, the study considers the influence on the performance of a single thread of the following migration and multi-core parameters: frequency of migration, core warm-up modes, subset of resources that are warmed-up, number of cores, and cache hierarchy organization. The results of this study can provide insight to architects on how to design performance-efficient power and thermal strategies for a multi-core chip.The experimental results, for the benchmarks and microarchitectures used in this study, show that the performance loss due to activity migration on a multi-core with private L1s and a shared L2 can be minimized if: (a) a migrating thread continues its execution on a core that was previously visited by the thread, and (b) cores remember their predictor state since their previous activation (all other core resources can be cold). The analogous conclusions for a multi-core with private L1s and L2s and a shared L3 are: remembering the predictor state, maintaining the tags of the various L2 caches coherent and allowing L2-L2 data transfers from inactive cores to the active core.The data also show that when migration period is at least every 160K cycles, the transfer of register state between two cores and the flushing of dirty private L1 data have a negligible performance overhead.

international symposium on microarchitecture | 1996

The performance potential of data dependence speculation and collapsing

Yiannakis Sazeides; Stamatis Vassiliadis; James E. Smith

Two hardware methods for remedying the effects of true data dependences are studied. The first method dependence speculation, is used to eliminate address generation-load dependences. This is enabled by address prediction that permits load instructions to proceed speculatively without waiting for their address operands. The second technique, dependence collapsing, is used to eliminate data dependences by combining a dependence among multiple instructions into one instruction. The potential of these techniques for improving processor performance is demonstrated via trace-driven simulation. When both techniques are used with maximum issue widths of 4, 8, 16, and 32, the overall speedups in comparison to a base instruction level parallel machine are 1.20, 1.35, 1.51, and 1.66, respectively. In general, dependence collapsing contributes the majority of the improvement in performance. Under the dependence collapsing model, 298 to 478 of the total number of instructions in a trace may be collapsed. The distance separating the collapsed instructions is nearly always less than 8. Our experimentation also suggests that further performance improvements can be achieved by incorporating mechanisms that increase the address prediction rate.

international symposium on computer architecture | 1998

Modeling program predictability

Yiannakis Sazeides; James E. Smith

Basic properties of program predictability --- for both values and control --- are defined and studied. We take the view that program predictability originates at certain points during a programs execution, flows through subsequent instructions, and then ends at other points in the program. These key components of predictability: generation, propagation, and termination; are defined in terms of a model. The model is based on a graph derived from dynamic data dependences and a predictor.Using the SPEC95 benchmarks, we analyze the predictability phenomena both separately and in combination. Examples are provided to illustrate relationships between model-based characteristics and program constructs. It is shown that most predictability derives from program control structure and immediate values, not program input data. Furthermore, most predictability originates from a relatively small number of generate points. The analysis of obtained results suggests a number of ramifications regarding predictability and its use.

ACM Transactions on Architecture and Code Optimization | 2007

A study of thread migration in temperature-constrained multicores

Pierre Michaud; André Seznec; Damien Fetis; Yiannakis Sazeides; Theofanis Constantinou

Temperature has become an important constraint in high-performance processors, especially multicores. Thread migration will be essential to exploit the full potential of future thermally constrained multicores. We propose and study a thread migration method that maximizes performance under a temperature constraint, while minimizing the number of migrations and ensuring fairness between threads. We show that thread migration brings important performance gains and that it is most effective during the first tens of seconds following a decrease of the number of running threads.

international symposium on performance analysis of systems and software | 2001

How to compare the performance of two SMT microarchitectures

Yiannakis Sazeides; Toni Juan

In this paper we discuss methods and metrics for comparing the performance of two simultaneous multithreading microarchitectures. We identify conditions under which the instructions-per-cycle metric may be misleading for comparing two simultaneous multithreading microarchitectures for the same amount of work. Part of the problem is isolated to the definition of what is same work. When simulating a mix of independent programs under the same initial conditions on two different simultaneous multithreading microarchitectures there are two approaches to ensure the work of the two runs is same: constant-work-per-thread or variablework-per-thread. For both approaches the total number of instructions in the run is constant, however, for the first, the instructions from each thread is also constant, whereas for the second is not. We claim that: (a) when simulating two microarchitectures with the constant-work-per-thread approach, the instructions-percycle is sufficient to compare them to establish the microarchitecture with the best performance, (b) when variable-work-per-thread approach is used the instruction-per-cycle may be inadequate for comparing performance. We attribute this to the inability of the instructions-per-cycle metric to account for differences in the load-balance of the two runs. A new performance metric,SMT-speedup, is proposed that enables accurate comparison of the performance of two simultaneous multithreading microarchitectures for runs with different load-balance. The new metric considers the loadbalance in terms of the size and performance of each thread. In light of the insight gain in this paper we contend that a simultaneous multithreading microarchitecture may need to trade-off throughput and load-balance to achieve the best performance.

IEEE Micro | 2012

Optimizing Data-Center TCO with Scale-Out Processors

Boris Grot; Damien Hardy; Pejman Lotfi-Kamran; Babak Falsafi; Chrysostomos Nicopoulos; Yiannakis Sazeides

Performance and total cost of ownership (TCO) are key optimization metrics in large-scale data centers. According to these metrics, data centers designed with conventional server processors are inefficient. Recently introduced processors based on low-power cores can improve both throughput and energy efficiency compared to conventional server chips. However, a specialized Scale-Out Processor (SOP) architecture maximizes on-chip computing density to deliver the highest performance per TCO and performance per watt at the data-center level.

IEEE Computer Architecture Letters | 2005

The Danger of Interval-Based Power Efficiency Metrics: When Worst Is Best

Yiannakis Sazeides; Rakesh Kumar; Dean M. Tullsen; Theofanis Constantinou

This paper shows that if the execution of a program is divided into distinct intervals, it is possible for one processor or configuration to provide the best power efficiency over every interval, and yet have worse overall power efficiency over the entire execution than other configurations. This unintuitive behavior is a result of a seemingly intuitive use of power efficiency metrics, and can result in suboptimal design and execution decisions. This behavior may occur when using the energy-delay product and energy-delay product metrics but not with the energy metric.

international symposium on performance analysis of systems and software | 2010

Performance-effective operation below Vcc-min

Nikolas Ladas; Yiannakis Sazeides; Veerle Desmet

Continuous circuit miniaturization and increased process variability point to a future with diminishing returns from dynamic voltage scaling. Operation below Vcc-min has been proposed recently as a mean to reverse this trend. The goal of this paper is to minimize the performance loss due to reduced cache capacity when operating below Vcc-min. A simple method is proposed: disable faulty blocks at low voltage. The method is based on observations regarding the distributions of faults in an array according to probability theory. The key lesson, from the probability analysis, is that as the number of uniformly distributed random faulty cells in an array increases the faults increasingly occur in already faulty blocks. The probability analysis is also shown to be useful for obtaining insight about the reliability implications of other cache techniques. For one configuration used in this paper, block disabling is shown to have on the average 6.6% and up to 29% better performance than a previously proposed scheme for low voltage cache operation. Furthermore, block-disabling is simple and less costly to implement and does not degrade performance at or above Vcc-min operation. Finally, it is shown that a victim-cache enables higher and more deterministic performance for a block-disabled cache.

Explore More