Is this you? Create Your Porfile

Stavros Tzilis

Chalmers University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Stavros Tzilis is active.

Explore More

Publication

Featured researches published by Stavros Tzilis.

Microprocessors and Microsystems | 2013

DeSyRe: On-demand system reliability

Ioannis Sourdis; Christos Strydis; Antonino Armato; Christos-Savvas Bouganis; Babak Falsafi; Georgi Gaydadjiev; Sebastian Isaza; Alirad Malek; R. Mariani; Dionisios N. Pnevmatikatos; Dhiraj K. Pradhan; Gerard K. Rauwerda; Robert M. Seepers; Rishad Ahmed Shafik; Kim Sunesen; Dimitris Theodoropoulos; Stavros Tzilis; Michalis Vavouras

The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect-/fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints.

international parallel and distributed processing symposium | 2014

A Dependable Coarse-Grain Reconfigurable Multicore Array

Georgios Smaragdos; Danish Anis Khan; Ioannis Sourdis; Christos Strydis; Alirad Malek; Stavros Tzilis

Recent trends in semiconductor technology have dictated the constant reduction of device size. One negative effect stemming from the reduction in size and increased complexity is the reduced device reliability. This paper is centered around the matter of permanent fault tolerance and graceful system degradation in the presence of permanent faults. We take advantage of the natural redundancy of homogeneous multicores following a sparing strategy to reuse functional pipeline stages of faulty cores. This is done by incorporating reconfigurable interconnects next to which the cores of the system are placed, providing the flexibility to redirect the data-flow from the faulty pipeline stages of damaged cores to spare (still) functional ones. Several micro-architectural changes are introduced to decouple the processor stages and allow them to be interchangeable. The proposed approach is a clear departure from previous ones by offering full flexibility as well as highly graceful performance degradation at reasonable costs. More specifically, our coarsegrain fault tolerant multicore array provides up to ×4 better availability compared to a conventional multicore and up to ×2 higher probability to deliver at least one functioning core in high fault densities. For our benchmarks, our design (synthesized for STM 65nm SP technology) incurs a total execution-time overhead for the complete system ranging from ×1.37 to ×3.3 compared to a (baseline) non-fault-tolerant system, depending on the permanent-fault density. The area overhead is 19.5% and the energy consumption, without incorporating any power/energy- saving technique, is estimated on average to be 20.9% higher compared to the baseline, unprotected design.

defect and fault tolerance in vlsi and nanotechnology systems | 2014

A runtime manager for gracefully degrading SoCs

Stavros Tzilis; Ioannis Sourdis

The increasing number of transistors integrated on a single chip comes with the blessing of raw computational power and the curse of susceptibility to various kinds of faults. On top of increased defect densities, wearout effects mean that the testing verdict at fabrication time cannot be trusted throughout the chip lifetime. However, extra computational power presents the opportunity to build gracefully degrading MPSoCs. Re-configurable components and flexible workloads, along with runtime support, enable MPSoCs to deal with permanent faults degrading one or more system aspects, such as performance, energy efficiency and delivered functionality, instead of failing. In this manner, chip life is prolonged and safety is increased. In this work Graceful Degradation (GD) is formulated as an optimization problem in the context of MPSoCs. As such, its possible solutions can be evaluated in a parameterizable and consistent manner. An attempt at a runtime solution for a heterogeneous 4-core SoC is made and the resulting GD manager is evaluated in terms of speed and accuracy, with a use case combining essential automotive tasks and non-essential additional features. On average, it is found to produce a solution 89% as good as the optimal, in 4.3μsec running on one core of a common modern CPU.

defect and fault tolerance in vlsi and nanotechnology systems | 2014

A probabilistic analysis of resilient reconfigurable designs

Alirad Malek; Stavros Tzilis; Danish Anis Khan; Ioannis Sourdis; Georgios Smaragdos; Christos Strydis

Reconfigurable hardware can be employed to tolerate permanent faults. Hardware components comprising a System-on-Chip can be partitioned into a handful of substitutable units interconnected with reconfigurable wires to allow isolation and replacement of faulty parts. This paper offers a probabilistic analysis of reconfigurable designs estimating for different fault densities the average number of fault-free components that can be constructed as well as the probability to guarantee a particular availability of components. Considering the area overheads of reconfigurability, we evaluate the resilience of various reconfigurable designs with different granularities. Based on this analysis, we conduct a comprehensive design-space exploration to identify the granularity mixes that maximize the fault-tolerance of a system. Our findings reveal that mixing fine-grain logic with a coarse-grain sparing approach tolerates up to 3× more permanent faults than component redundancy and 2× more than any other purely coarse-grain solution. Component redundancy is preferable at low fault densities, while coarse-grain and mixed-grain reconfigurability maximize availability at medium and high fault densities, respectively.

compilers architecture and synthesis for embedded systems | 2016

Runtime management of adaptive MPSoCs for graceful degradation

Stavros Tzilis; Ioannis Sourdis; Vasileios Vasilikos; Dimitrios Rodopoulos; Dimitrios Soudris

In this paper we propose optimization algorithms for the runtime management of gracefully degradable adaptive MP-SoCs. Assuring the reliability of all hardware components in a system becomes increasingly difficult. On top of the growing defect densities and rising complexity of conventional testing, wear-out effects may reduce the availability of on-chip resources during system lifetime. However, adaptability of modern MPSoCs can provide the means for permanent fault tolerance and graceful degradation via runtime system management. We have developed custom heuristics as well as tailored existing optimization techniques (simulated annealing and genetic algorithm), to deliver a fast and efficient response to unpredictable loss of system resources. We have emulated the resulting runtime manager on the Intel Single-Chip Cloud Computer (SCC), an experimental chip multiprocessor developed by Intel Labs. Comparison of the different algorithms in terms of solution quality and response time, and the scaling of their response time with the size of problem input, indicate that our custom heuristics are faster by at least one order of magnitude, but simulated annealing and genetic algorithm are more consistent in dealing with constraints to the allowed solutions, e.g. limited system reconfiguration time. All algorithms scale well, since their response time, in almost every case, grows sub-linearly with respect to the input size.

IEEE Micro | 2016

Resilient Chip Multiprocessors with Mixed-Grained Reconfigurability

Ioannis Sourdis; Danish Anis Khan; Alirad Malek; Stavros Tzilis; Georgios Smaragdos; Christos Strydis

This article presents a chip multiprocessor (CMP) design that mixes coarse- and fine-grained reconfigurability to increase core availability of safety-critical embedded systems in the presence of hard errors. The authors conducted a comprehensive design-space exploration to identify the granularity mixes that maximize CMP fault tolerance and minimize performance and energy overheads. The authors added fine-grained reconfigurable logic to a coarse-grained sparing approach. Their resulting design can tolerate 3 times more hard errors than core redundancy and 1.5 times more than any other purely coarse-grained solution.

ACM Transactions in Embedded Computing Systems | 2016

RQNoC: A Resilient Quality-of-Service Network-on-Chip with Service Redirection

Alirad Malek; Ioannis Sourdis; Stavros Tzilis; Y Yifan He; Gerard K. Rauwerda

In this article, we describe RQNoC, a service-oriented Network-on-Chip (NoC) resilient to permanent faults. We characterize the network resources based on the particular service that they support and, when faulty, bypass them, allowing the respective traffic class to be redirected. We propose two alternatives for service redirection, each having different advantages and disadvantages. The first one, Service Detour, uses longer alternative paths through resources of the same service to bypass faulty network parts, keeping traffic classes isolated. The second approach, Service Merge, uses resources of other services providing shorter paths but allowing traffic classes to interfere with each other. The remaining network resources that are common for all services employ additional mechanisms for tolerating faults. Links tolerate faults using additional spare wires combined with a flit-shifting mechanism, and the router control is protected with Triple-Modular-Redundancy (TMR). The proposed RQNoC network designs are implemented in 65nm technology and evaluated in terms of performance, area, power consumption, and fault tolerance. Service Detour requires 9% more area and consumes 7.3% more power compared to a baseline network, not tolerant to faults. Its packet latency and throughput is close to the fault-free performance at low-fault densities, but fault tolerance and performance drop substantially for 8 or more network faults. Service Merge requires 22% more area and 27% more power than the baseline and has a 9% slower clock. Compared to a fault-free network, a Service Merge RQNoC with up to 32 faults has increased packet latency up to 1.5 to 2.4× and reduced throughput to 70% or 50%. However, it delivers substantially better fault tolerance, having a mean network connectivity above 90% even with 32 network faults versus 41% of a Service Detour network. Combining Serve Merge and Service Detour improves fault tolerance, further sustaining a higher number of network faults and reduced packet latency.

digital systems design | 2012

The DeSyRe Project: On-Demand System Reliability

Ioannis Sourdis; Christos Strydis; Christos-Savvas Bouganis; Babak Falsafi; Georgi Gaydadjiev; Alirad Malek; R. Mariani; Dionisios N. Pnevmatikatos; Dhiraj K. Pradhan; Gerard K. Rauwerda; Kim Sunesen; Stavros Tzilis

international symposium on parallel and distributed processing and applications | 2014

The DeSyRe Runtime Support for Fault-Tolerant Embedded MPSoCs

Dionisios N. Pnevmatikatos; Stavros Tzilis; Ioannis Sourdis

Semiconductor technology scaling makes chips more sensitive to faults. This paper describes the DeSyRe design approach and its runtime management for future reliable embedded Multiprocessor Systems-on-Chip (MPSoCs). A light weight runtime system is described for shared-memory MPSoCs to support fault-tolerant execution upon detection of transient and permanent faults. The DeSyRe runtime system offers re-execution of tasks that suffer from transient faults and task-migration in cases where a worker processor is permanently faulty. In addition, a faulty worker can potentially remain usable, increasing systems fault-tolerance. This is achieved using alternative task implementations, which avoid the faulty circuit and are indicated in the application-code via pragma annotations, as well as by repairing a faulty core via hardware reconfiguration. Thereby, the system can be dynamically adapted using one or multiple of the above mechanisms to mitigate faults. The DeSyRe runtime system is evaluated using micro-benchmarks running on a Virtex-6 FPGA MPSoC. Results suggest that our enhance default-tolerant runtime system can successfully and efficiently execute all application tasks under a variety of fault cases.

international conference on parallel processing | 2017

SWAS: Stealing Work Using Approximate System-Load Information

Stavros Tzilis; Miquel Pericàs; Pedro Trancoso; Ioannis Sourdis

This paper explores the potential of utilizing approximate system load information to enhance work stealing for dynamic load balancing in hierarchical multicore systems. Maintaining information about the load of a system has not been extensively researched since it is assumed to introduce performance overheads. We propose SWAS, a lightweight approximate scheme for retrieving and using such information, based on compact bit vector structures and lightweight update operations. This approximate information is used to enhance the effectiveness of work stealing decisions. Evaluating SWAS for a number of representative scenarios on a multi-socket multi-core platform showed that work stealing guided by approximate system load information achieves considerable performance improvements: up to 18.5% for dynamic, severely imbalanced workloads; and up to 34.4% for workloads with complex task dependencies, when compared with random work stealing.

Explore More