Alirad Malek
Chalmers University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Alirad Malek.
Microprocessors and Microsystems | 2013
Ioannis Sourdis; Christos Strydis; Antonino Armato; Christos-Savvas Bouganis; Babak Falsafi; Georgi Gaydadjiev; Sebastian Isaza; Alirad Malek; R. Mariani; Dionisios N. Pnevmatikatos; Dhiraj K. Pradhan; Gerard K. Rauwerda; Robert M. Seepers; Rishad Ahmed Shafik; Kim Sunesen; Dimitris Theodoropoulos; Stavros Tzilis; Michalis Vavouras
The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect-/fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints.
international parallel and distributed processing symposium | 2014
Georgios Smaragdos; Danish Anis Khan; Ioannis Sourdis; Christos Strydis; Alirad Malek; Stavros Tzilis
Recent trends in semiconductor technology have dictated the constant reduction of device size. One negative effect stemming from the reduction in size and increased complexity is the reduced device reliability. This paper is centered around the matter of permanent fault tolerance and graceful system degradation in the presence of permanent faults. We take advantage of the natural redundancy of homogeneous multicores following a sparing strategy to reuse functional pipeline stages of faulty cores. This is done by incorporating reconfigurable interconnects next to which the cores of the system are placed, providing the flexibility to redirect the data-flow from the faulty pipeline stages of damaged cores to spare (still) functional ones. Several micro-architectural changes are introduced to decouple the processor stages and allow them to be interchangeable. The proposed approach is a clear departure from previous ones by offering full flexibility as well as highly graceful performance degradation at reasonable costs. More specifically, our coarsegrain fault tolerant multicore array provides up to ×4 better availability compared to a conventional multicore and up to ×2 higher probability to deliver at least one functioning core in high fault densities. For our benchmarks, our design (synthesized for STM 65nm SP technology) incurs a total execution-time overhead for the complete system ranging from ×1.37 to ×3.3 compared to a (baseline) non-fault-tolerant system, depending on the permanent-fault density. The area overhead is 19.5% and the energy consumption, without incorporating any power/energy- saving technique, is estimated on average to be 20.9% higher compared to the baseline, unprotected design.
defect and fault tolerance in vlsi and nanotechnology systems | 2014
Alirad Malek; Stavros Tzilis; Danish Anis Khan; Ioannis Sourdis; Georgios Smaragdos; Christos Strydis
Reconfigurable hardware can be employed to tolerate permanent faults. Hardware components comprising a System-on-Chip can be partitioned into a handful of substitutable units interconnected with reconfigurable wires to allow isolation and replacement of faulty parts. This paper offers a probabilistic analysis of reconfigurable designs estimating for different fault densities the average number of fault-free components that can be constructed as well as the probability to guarantee a particular availability of components. Considering the area overheads of reconfigurability, we evaluate the resilience of various reconfigurable designs with different granularities. Based on this analysis, we conduct a comprehensive design-space exploration to identify the granularity mixes that maximize the fault-tolerance of a system. Our findings reveal that mixing fine-grain logic with a coarse-grain sparing approach tolerates up to 3× more permanent faults than component redundancy and 2× more than any other purely coarse-grain solution. Component redundancy is preferable at low fault densities, while coarse-grain and mixed-grain reconfigurability maximize availability at medium and high fault densities, respectively.
IEEE Micro | 2016
Ioannis Sourdis; Danish Anis Khan; Alirad Malek; Stavros Tzilis; Georgios Smaragdos; Christos Strydis
This article presents a chip multiprocessor (CMP) design that mixes coarse- and fine-grained reconfigurability to increase core availability of safety-critical embedded systems in the presence of hard errors. The authors conducted a comprehensive design-space exploration to identify the granularity mixes that maximize CMP fault tolerance and minimize performance and energy overheads. The authors added fine-grained reconfigurable logic to a coarse-grained sparing approach. Their resulting design can tolerate 3 times more hard errors than core redundancy and 1.5 times more than any other purely coarse-grained solution.
ACM Transactions in Embedded Computing Systems | 2016
Alirad Malek; Ioannis Sourdis; Stavros Tzilis; Y Yifan He; Gerard K. Rauwerda
In this article, we describe RQNoC, a service-oriented Network-on-Chip (NoC) resilient to permanent faults. We characterize the network resources based on the particular service that they support and, when faulty, bypass them, allowing the respective traffic class to be redirected. We propose two alternatives for service redirection, each having different advantages and disadvantages. The first one, Service Detour, uses longer alternative paths through resources of the same service to bypass faulty network parts, keeping traffic classes isolated. The second approach, Service Merge, uses resources of other services providing shorter paths but allowing traffic classes to interfere with each other. The remaining network resources that are common for all services employ additional mechanisms for tolerating faults. Links tolerate faults using additional spare wires combined with a flit-shifting mechanism, and the router control is protected with Triple-Modular-Redundancy (TMR). The proposed RQNoC network designs are implemented in 65nm technology and evaluated in terms of performance, area, power consumption, and fault tolerance. Service Detour requires 9% more area and consumes 7.3% more power compared to a baseline network, not tolerant to faults. Its packet latency and throughput is close to the fault-free performance at low-fault densities, but fault tolerance and performance drop substantially for 8 or more network faults. Service Merge requires 22% more area and 27% more power than the baseline and has a 9% slower clock. Compared to a fault-free network, a Service Merge RQNoC with up to 32 faults has increased packet latency up to 1.5 to 2.4× and reduced throughput to 70% or 50%. However, it delivers substantially better fault tolerance, having a mean network connectivity above 90% even with 32 network faults versus 41% of a Service Detour network. Combining Serve Merge and Service Detour improves fault tolerance, further sustaining a higher number of network faults and reduced packet latency.
digital systems design | 2012
Ioannis Sourdis; Christos Strydis; Christos-Savvas Bouganis; Babak Falsafi; Georgi Gaydadjiev; Alirad Malek; R. Mariani; Dionisios N. Pnevmatikatos; Dhiraj K. Pradhan; Gerard K. Rauwerda; Kim Sunesen; Stavros Tzilis
The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect-/fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe will deliver a new generation of systems that are reliable by design at well-balanced power, performance, and design costs.
Proceedings of the International Symposium on Memory Systems | 2017
Alirad Malek; Evangelos Vasilakis; Vasileios Papaefstathiou; Pedro Trancoso; Ioannis Sourdis
An application may have different sensitivity to faults in different subsets of the data it uses. Some data regions may therefore be more critical than others. Capitalizing on this observation, Odd-ECC provides a mechanism to dynamically select the memory fault tolerance of each allocated page of a program on demand depending on the criticality of the respective data. Odd-ECC error correcting codes (ECCs) are stored in separate physical pages and hidden by the OS as pages unavailable to the user. Still, these ECCs are physically aligned with the data they protect so the memory controller can efficiently access them. Thereby, capacity, performance and energy overheads of memory fault tolerance are proportional to the criticality of the data stored. Odd-ECC is applied to memory systems that use conventional 2D DRAM DIMMs as well as to 3D-stacked DRAMs and evaluated using various applications. Compared to flat memory protection schemes, Odd-ECC substantially reduces ECCs capacity overheads while achieving the same Mean Time to Failure (MTTF) and in addition it slightly improves performance and energy costs. Under the same capacity constraints, Odd-ECC achieves substantially higher MTTF, compared to a flat memory protection. This comes at a performance and energy cost, which is however still a fraction of the cost introduced by a flat equally strong scheme.
defect and fault tolerance in vlsi and nanotechnology systems | 2015
Alirad Malek; Stavros Tzilis; Danish Anis Khan; Ioannis Sourdis; Georgios Smaragdos; Christos Strydis
Permanent faults on a chip are often tolerated using spare resources. In the past, sparing has been applied to Chip Multiprocessors (CMPs) at various granularities of substitutable units (SUs). Entire processors, pipeline stages or even individual functional units are isolated when faulty and replaced by spare ones using flexible, reconfigurable interconnects. Although spare resources increase systems fault tolerance, the extra delay imposed by the reconfigurable interconnects limits performance. In this paper, we study two options for dealing with this delay: (i) pipelining the reconfigurable interconnects and (ii) scaling down operating frequency. The former keeps a frequency close to the one of the baseline processor, but increases the number of cycles required for executing a program. The latter maintains the number of execution cycles constant, but requires a slower clock. We investigate the above performance tradeoff using an adaptive 4-core CMP design with substitutable pipeline stages. We retrieve post place and route results of different designs running two sets of benchmarks and evaluate their performance. Our experiments indicate that adding reconfigurable interconnects for wiring the SUs of a 4-core CMP pose significant delay increasing the critical path of the design almost by 3.5 times. On the other hand, pipelining the reconfigurable interconnects increases cycle time by 41% and - depending on the processor configuration - reduces performance overhead to 1.4-2.9× the execution time of the baseline.
applied reconfigurable computing | 2014
Ioannis Sourdis; Christos Strydis; Antonino Armato; Christos-Savvas Bouganis; Babak Falsafi; Georgi Gaydadjiev; Sebastian Isaza; Alirad Malek; R. Mariani; Samuel Pagliarini; Dionisios N. Pnevmatikatos; Dhiraj K. Pradhan; Gerard K. Rauwerda; Robert M. Seepers; Rishad Ahmed Shafik; Georgios Smaragdos; Dimitris Theodoropoulos; Stavros Tzilis; Michalis Vavouras
The DeSyRe project builds on-demand adaptive, reliable Systems-on-Chips. In response to the current semiconductor technology trends thatmake chips becoming less reliable, DeSyRe describes a newgeneration of by design reliable systems, at a reduced power and performance cost. This is achieved through the following main contributions. DeSyRe defines a fault-tolerant system architecture built out of unreliable components, rather than aiming at totally fault-free and hence more costly chips. In addition, DeSyRe systems are on-demand adaptive to various types and densities of faults, as well as to other system constraints and application requirements. For leveraging on-demand adaptation/customization and reliability at reduced cost, a new dynamically reconfigurable substrate is designed and combined with runtime system software support. The above define a generic and repeatable design framework, which is applied to two medical SoCs with high reliability constraints and diverse performance and power requirements. One of the main goals of the DeSyRe project is to increase the availability of SoC components in the presence of permanents faults, caused at manufacturing time or due to device aging. A mix of coarse- and fine-grain reconfigurable hardware substrate is designed to isolate and bypass faulty component parts. The flexibility provided by the DeSyRe reconfigurable substrate is exploited at runtime by system optimization heuristics,which decide tomodify component configurationwhen a permanent fault is detected, providing graceful degradation.
Archive | 2017
Alirad Malek