Sotiris Tselonis
National and Kapodistrian University of Athens
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sotiris Tselonis.
ieee international symposium on workload characterization | 2015
Sotiris Tselonis; Athanasios Chatzidimitriou; Nikos Foutris; Dimitris Gizopoulos
Fault injection on micro architectural structures modeled in performance simulators is an effective method for the assessment of microprocessors reliability in early design stages. Compared to lower level fault injection approaches it is orders of magnitude faster and allows execution of large portions of workloads to study the effect of faults to the final program output. Moreover, for many important hardware components it delivers accurate reliability estimates compared to analytical methods which are fast but are known to significantly over-estimate a structures vulnerability to faults. This paper investigates the effectiveness of micro architectural fault injection for x86 and ARM microprocessors in a differential way: by developing and comparing two fault injection frameworks on top of the most popular performance simulators, MARSS and Gem5. The injectors, called MaFIN and GeFIN (for MARSS-based and Gem5-based Fault Injector, respectively), are designed for accurate reliability studies and deliver several contributions among which: (a) reliability studies for a wide set of fault models on major hardware structures (for different sizes and organizations), (b) study on the reliability sensitivity of micro architecture structures for the same ISA (x86) implemented on two different simulators, (c) study on the reliability of workloads and micro architectures for the two most popular ISAs (ARM vs. x86). For the workloads of our experimental study we analyze the common trends observed in the CPU reliability assessments produced by the two injectors. Also, we explain the sources of difference when diverging reliability reports are provided by the tools. Both the common trends and the differences are attributed to fundamental implementations of the simulators and are supported by benchmarks runtime statistics. The insights of our analysis can guide the selection of the most appropriate tool for hardware reliability studies (and thus decision-making for protection mechanisms) on certain micro architectures for the popular x86 and ARM ISAs.
international on-line testing symposium | 2014
Nikos Foutris; Sotiris Tselonis; Dimitris Gizopoulos
Forthcoming technologies hold the promise of a significant increase in integration density, performance and functionality. However, a dramatic change in microprocessors reliability is also expected. Developing mechanisms for early and accurate reliability estimation will save significant design effort, resources and consequently will positively impact products time-to-market (TTM). In this paper, we propose a versatile architecture-level fault injection framework, built on top of a state-of-the-art x86 microprocessor simulator, for thorough and fast characterization of a wide range of hardware components with respect to various fault models.
international symposium on performance analysis of systems and software | 2016
Sotiris Tselonis; Dimitris Gizopoulos
Modern many-core Graphics Processing Units (GPUs) are extensively employed in general purpose computing (GPGPU), offering a remarkable execution speedup to inherently data parallel workloads. Unlike graphics computing, GPGPU computing has more stringent reliability requirements. Thus, accurate reliability assessment of GPU hardware structures is important for making informed decisions for error protection. In this paper we focus on microarchitecture-level reliability assessment for GPU architectures. The paper makes the following contributions. First, it presents a comprehensive fault injection framework that targets key hardware structures of GPU architectures such as the register file, the shared memory, the SIMT stack and the instruction buffer, which altogether occupy large part of a modern GPU silicon area. Second, it reports our reliability assessment findings for the target structures, when the GPU executes a diverse set of twelve GPGPU applications. Third, it discusses remarkable differences in the results of fault injection when the applications are simulated in the virtual NVIDIA GPUs instruction set (ptx) vs. the actual instruction set (sass). Finally, it discusses how the framework can be employed either by architects in the early stages of design phase or by programmers for a GPU applications error resilience enhancement.
international on-line testing symposium | 2013
Sotiris Tselonis; Vasilis Dimitsas; Dimitris Gizopoulos
Massively parallel many-core Graphics Processing Unit (GPU) architectures offer significant performance speedup in workloads with thread-level parallelism compared to contemporary multicore CPUs. For this reason, general-purpose computing using GPUs (GPGPU) is a rapidly expanding research direction in different contexts. Unlike graphics processing, GPGPU computing requires reliable operation in the presence of hardware faults whose occurrence probabilities in current and forthcoming advanced manufacturing technologies will be significant. In this paper, we focus on the aspect of tolerance of GPUs to permanent faults in their most critical storage elements: register files. By performing a comprehensive fault injection campaign on a cycle-accurate GPGPU architectural simulator, we first evaluate and classify the behavior of NVIDIA GPU CUDA kernels in the presence of permanent faults in registers. Moreover, we analyze the performance tolerance of GPUs when they operate in degraded mode (less hardware resources, less thread-level parallelism) due to the presence of multiple permanent faults in the registers of their streaming multiprocessors. Our findings confirm the intuitively expected tolerance of these architectures to faults and also quantify it in different configurations and modes.
Microprocessors and Microsystems | 2015
Alessandro Vallero; Sotiris Tselonis; Nikos Foutris; Maha Kooli; Alessandro Savino; Gianfranco Michele Maria Politano; Alberto Bosio; G. Di Natale; Dimitris Gizopoulos; S. Di Carlo
Advanced computing systems realized in forthcoming technologies hold the promise of a significant increase of computational capabilities. However, the same path that is leading technologies toward these remarkable achievements is also making electronic devices increasingly unreliable. Developing new methods to evaluate the reliability of these systems in an early design stage has the potential to save costs, produce optimized designs and have a positive impact on the product time-to-market.CLERECO European FP7 research project addresses early reliability evaluation with a cross-layer approach across different computing disciplines, across computing system layers and across computing market segments. The fundamental objective of the project is to investigate in depth a methodology to assess system reliability early in the design cycle of the future systems of the emerging computing continuum. This paper presents a general overview of the CLERECO project focusing on the main tools and models that are being developed that could be of interest for the research community and engineering practice.
vlsi test symposium | 2017
Athanasios Chatzidimitriou; Sotiris Tselonis; Dimitris Gizopoulos
Technology evolution has raised serious reliability considerations, as transistor dimensions shrink and modern microprocessors become denser and more vulnerable to faults. Reliability studies have proposed a plethora of methodologies for assessing system vulnerability which, however, highly rely on traditional reliability metrics that solely express failure rate over time. Although Failures In Time (FIT) is a very strong and representative reliability metric, it may fail to offer an objective comparison of highly diverse systems, such as CPUs against GPUs or other accelerators that are often employed to execute the same algorithms implemented for these platforms.
international test conference | 2016
Alessandro Vallero; Alessandro Savino; Gianfranco Michele Maria Politano; S. Di Carlo; Athanasios Chatzidimitriou; Sotiris Tselonis; Dimitris Gizopoulos; Marc Riera; Ramon Canal; Antonio González; Maha Kooli; A. Bosio; G. Di Natale
System reliability estimation during early design phases facilitates informed decisions for the integration of effective protection mechanisms against different classes of hardware faults. When not all system abstraction layers (technology, circuit, microarchitecture, software) are factored in such an estimation model, the delivered reliability reports must be excessively pessimistic and thus lead to unacceptably expensive, over-designed systems. We propose a scalable, cross-layer methodology and supporting suite of tools for accurate but fast estimations of computing systems reliability. The backbone of the methodology is a component-based Bayesian model, which effectively calculates system reliability based on the masking probabilities of individual hardware and software components considering their complex interactions. Our detailed experimental evaluation for different technologies, microarchitectures, and benchmarks demonstrates that the proposed model delivers very accurate reliability estimations (FIT rates) compared to statistically significant but slow fault injection campaigns at the microarchitecture level.
international symposium on performance analysis of systems and software | 2017
Alessandro Vallero; Stefano Di Carlo; Sotiris Tselonis; Dimitris Gizopoulos
State-of-the-art GPU chips are designed to deliver extreme throughput for graphics as well as for data-parallel general purpose computing workloads (GPGPU computing). Unlike graphics computing, GPGPU computing requires highly reliable operation. The performance-oriented design of GPUs requires to jointly evaluate the vulnerability of GPU workloads to soft-errors with the performance of GPU chips. We briefly present a summary of the findings of an extensive study aiming at the evaluation of the reliability of four GPU architectures and corresponding chips, orrelating them with the performance of the workloads.
international on-line testing symposium | 2015
Alessandro Vallero; Alessandro Savino; Sotiris Tselonis; Nikos Foutris; Gianfranco Michele Maria Politano; Dimitris Gizopoulos; S. Di Carlo
Analyzing the impact of software execution on the reliability of a complex digital system is an increasing challenging task. Current approaches mainly rely on time consuming fault injections experiments that prevent their usage in the early stage of the design process, when fast estimations are required in order to take design decisions. To cope with these limitations, this paper proposes a statistical reliability analysis model based on Bayesian Networks. The proposed approach is able to estimate system reliability considering both the hardware and the software layer of a system, in presence of hardware transient and permanent faults. In fact, when digital system reliability is under analysis, hardware resources of the processor and instructions of program traces are employed to build a Bayesian Network. Finally, the probability of input errors to alter both the correct behavior of the system and the output of the program is computed. According to experimental results presented in this paper, it can be stated that Bayesian Network model is able to provide accurate reliability estimations in a very short period of time. As a consequence it can be a valid alternative to fault injection, especially in the early stage of the design.
vlsi test symposium | 2016
Sotiris Tselonis; Nikos Foutris; George N. Papadimitriou; Dimitris Gizopoulos
Early decisions in microprocessor design require a careful consideration of the corresponding performance and reliability implications of transient faults. The size and organization of important on-chip hardware components such as caches, register files and buffers have a direct impact on both the microprocessor resilience to soft errors and the execution time of the applications. In this paper, we employ a state-of-the-art x86-64 full-system micro-architectural simulator and a comprehensive fault injection framework built on top of it to deliver a detailed evaluation of the reliability and performance tradeoffs for major hardware components across several important parameters of their design (size, associativity, write policy, etc.). We also propose a simple and flexible fitness function that measures the aggregate effect of such design changes on the reliability and the performance of the studied workload.