Kyriakos Stavrou
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kyriakos Stavrou.
architectural support for programming languages and operating systems | 2014
Marc Lupon; Enric Gibert; Grigorios Magklis; Sridhar Samudrala; Raúl Martínez; Kyriakos Stavrou; David R. Ditzel
A Fused Multiply-Add (FMA) instruction is currently available in many general-purpose processors. It increases performance by reducing latency of dependent operations and increases precision by computing the result as an indivisible operation with no intermediate rounding. However, since the arithmetic behavior of a single-rounding FMA operation is different than independent FP multiply followed by FP add instructions, some algorithms require significant revalidation and rewriting efforts to work as expected when they are compiled to operate with FMA--a cost that developers may not be willing to pay. Because of that, abundant legacy applications are not able to utilize FMA instructions. In this paper we propose a novel HW/SW collaborative technique that is able to efficiently execute workloads with increased utilization of FMA, by adding the option to get the same numerical result as separate FP multiply and FP add pairs. In particular, we extended the host ISA of a HW/SW co-designed processor with a new Combined Multiply-Add (CMA) instruction that performs an FMA operation with an intermediate rounding. This new instruction is used by a transparent dynamic translation software layer that uses a speculative instruction-fusion optimization to transform FP multiply and FP add sequences into CMA instructions. The FMA unit has been slightly modified to support both single-rounding and double-rounding fused instructions without increasing their latency and to provide a conservative fall-back path in case of mispeculation. Evaluation on a cycle-accurate timing simulator showed that CMA improved SPECfp performance by 6.3% and reduced executed instructions by 4.7%.
symposium on code generation and optimization | 2014
Aleksandar Branković; Kyriakos Stavrou; Enric Gibert; Antonio González
Evaluation techniques in microprocessor design are mostly based on simulating selected application samples using a cycle-accurate simulator. In order to achieve accurate results, microarchitectural structures are warmed-up for a few million instructions prior to statistics collection. Unfortunately, this strategy cannot be applied to HW/SW co-designed processors, in which a Transparent Optimization software Layer (TOL) translates and optimizes code on-the-fly from a guest ISA to an internal host custom microarchitecture. We show that the warm-up period in this case needs to be 3-4 orders of magnitude longer than what is needed for traditional microprocessor designs because the TOL state needs to be warmed-up as well. In this paper, we propose a novel simulation technique for HW/SW co-designed processors based on adapting the optimization promotion thresholds using high level application statistics in order to find the best trade-off between accuracy and simulation cost. In particular, the proposed technique reduces the simulation cost by 65X with an average error of just 0.75%. Furthermore, as opposed to other alternatives, the proposed technique satisfies the additional requirement of allowing evaluation using different TOL and microarchitectural configurations.
computing frontiers | 2013
Aleksandar Branković; Kyriakos Stavrou; Enric Gibert; Antonio González
Dynamic Binary Translators and Optimizers (DBTOs) have been established as a common architecture during the last years. They are used in many different systems, such as emulation, instrumentation tools and innovative HW/SW co-designed microarchitectures. Although many researchers worked on characterizing and reducing the emulation overhead, there are no published results that explain how the DBTO behaves from the microarchitectural prospective and how its behavior may be predicted based on high-level, guest application statistics. Such results are important for guiding design decisions and system optimization. In this paper we study the DBTO as an independent application by dividing its functionality into modules. We show that the behavior of the DBTO is not constant at all. The contribution of the different modules in the total overhead, the overhead itself, the microarchitectural interaction with the emulated application and the microarchitectural profile of the different modules changes significantly based on the emulated application. This result comes in contrast to numerous papers that consider this behavior constant and exclude the DBTO from the simulation. Throughout this paper we detail this variance, we quantify it and we explain the reasons behind it. The insights presented in this work can be exploited towards the design of more efficient DBTOs and their early performance evaluation.
computing frontiers | 2014
Aleksandar Branković; Kyriakos Stavrou; Enric Gibert; Antonio González
Archive | 2014
Kyriakos Stavrou; Pedro Marcuello; Grigorios Magklis; Javier Carretero Casado; Juan Fernández; Carlos Madriles; Daniel Ortega; Demos Pavlou
Archive | 2013
Raúl Martínez; Enric Gibert Codina; Marc Lupon; Kyriakos Stavrou
Archive | 2013
Marc Lupon; Raúl Martínez; Enric Gibert Codina; Kyriakos Stavrou; Grigorios Magklis; Sridhar Samudrala
Archive | 2017
Grigorios Magklis; Josep M. Codina; Craig B. Zilles; Michael Neilly; Sridhar Samudrala; Alejandro Martinez Vicente; Polychronis Xekalakis; F. Jesús Sánchez; Marc Lupon; Georgios Tournavitis; Enric Gibert Codina; Crispin Gomez Requena; Antonio González; Mirem Hyuseinova; Christos E. Kotselidis; Fernando Latorre; Pedro Lopez; Carlos Madriles Gimeno; Pedro Marcuello; Raúl Martínez; Daniel Ortega; Demos Pavlou; Kyriakos Stavrou
Archive | 2014
Lutz Naethke; Axel Borkowski; Bert Bretschneider; Kyriakos Stavrou; Rainer Theuer
Archive | 2013
Marc Lupon; Grigorios Magklis; Sridhar Samudrala; Raúl Martínez; Kyriakos Stavrou; Enric Gibert Codina