Helia Naeimi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Helia Naeimi is active.

Explore More

Publication

Featured researches published by Helia Naeimi.

IEEE Design & Test of Computers | 2005

Seven strategies for tolerating highly defective fabrication

André DeHon; Helia Naeimi

This article we present an architecture that supports fine-grained sparing and resource matching. The base logic structure is a set of interconnected PLAs. The PLAs and their interconnections consist of large arrays of interchangeable nanowires, which serve as programmable product and sum terms and as programmable interconnect links. Each nanowire can have several defective programmable junctions. We can test nanowires for functionality and use only the subset that provides appropriate conductivity and electrical characteristics. We then perform a matching between nanowire junction programmability and application logic needs to use almost all the nanowires even though most of them have defective junctions. We employ seven high-level strategies to achieve this level of defect tolerance.

field-programmable custom computing machines | 2004

Design patterns for reconfigurable computing

André DeHon; Joshua Adams; Michael deLorimier; Nachiket Kapre; Yuki Matsuda; Helia Naeimi; Michael C. Vanier; Michael G. Wrighton

It is valuable to identify and catalog design patterns for reconfigurable computing. These design patterns are canonical solutions to common and recurring design challenges which arise in reconfigurable systems and applications. The catalog can form the basis for creating designs, for educating new designers, for understanding the needs of tools and languages, and for discussing reconfigurable design. Tying application and implementation lessons to the expansion and refinement of this catalog make those lessons more relevant to the design community. In this paper, we articulate this role for design patterns in reconfigurable computing, provide a few example patterns, offer a starting point for the contents of the catalog, and discuss the potential benefits of this effort.

field-programmable technology | 2004

A greedy algorithm for tolerating defective crosspoints in nanoPLA design

Helia Naeimi; André DeHon

Recent developments suggest both plausible fabrication techniques and viable architectures for building sublithographic programmable logic arrays using molecular-scale wires and switches. Designs at this scale will see much higher defect rates than in conventional lithography. However, these defects need not be an impediment to programmable logic design as this scale. We introduce a strategy for tolerating defective crosspoints and develop a linear-time, greedy algorithm for mapping PLA logic around crosspoint defects. We note that P-term fanin must be bounded to guarantee low overhead mapping and develop analytical guidelines for bounding fanin. We further quantify analytical and empirical mapping overhead rates. Including fanin bounding, our greedy mapping algorithm maps a large set of benchmark designs with 13% average overhead for random junction defect rates as high as 20%.

IEEE Transactions on Very Large Scale Integration Systems | 2009

Fault Secure Encoder and Decoder for NanoMemory Applications

Helia Naeimi; André DeHon

Memory cells have been protected from soft errors for more than a decade; due to the increase in soft error rate in logic circuits, the encoder and decoder circuitry around the memory blocks have become susceptible to soft errors as well and must also be protected. We introduce a new approach to design fault-secure encoder and decoder circuitry for memory designs. The key novel contribution of this paper is identifying and defining a new class of error-correcting codes whose redundancy makes the design of fault-secure detectors (FSD) particularly simple. We further quantify the importance of protecting encoder and decoder circuitry against transient errors, illustrating a scenario where the system failure rate (FIT) is dominated by the failure rate of the encoder and decoder. We prove that Euclidean geometry low-density parity-check (EG-LDPC) codes have the fault-secure detector capability. Using some of the smaller EG-LDPC codes, we can tolerate bit or nanowire defect rates of 10% and fault rates of 10-18 upsets/device/cycle, achieving a FIT rate at or below one for the entire memory system and a memory density of 1011 bit/cm2 with nanowire pitch of 10 nm for memory blocks of 10 Mb or larger. Larger EG-LDPC codes can achieve even higher reliability and lower area overhead.

architectural support for programming languages and operating systems | 2012

Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults

Siva Kumar Sastry Hari; Sarita V. Adve; Helia Naeimi

Future microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low-cost monitors of software-level symptoms of such faults. Recently, researchers have shown these mechanisms work well, but there remains a non-negligible risk that several faults may escape the symptom detectors and result in silent data corruptions (SDCs). Most prior evaluations of symptom-based detectors perform fault injection campaigns on application benchmarks, where each run simulates the impact of a fault injected at a hardware site at a certain point in the applications execution (application fault site). Since the total number of application fault sites is very large (trillions for standard benchmark suites), it is not feasible to study all possible faults. Previous work therefore typically studies a randomly selected sample of faults. Such studies do not provide any feedback on the portions of the application where faults were not injected. Some of those instructions may be vulnerable to SDCs, and identifying them could allow protecting them through other means if needed. This paper presents Relyzer, an approach that systematically analyzes all application fault sites and carefully picks a small subset to perform selective fault injections for transient faults. Relyzer employs novel fault pruning techniques that prune faults that need detailed study by either predicting their outcomes or showing them equivalent to other faults. We find that Relyzer prunes about 99.78% of the total faults across twelve applications studied here, reducing the faults that require detailed simulation by 3 to 5 orders of magnitude for most of the applications. Fault injection simulations on the remaining faults can identify SDC causing faults in the entire application. Some of Relyzers techniques rely on heuristics to determine fault equivalence. Our validation efforts show that Relyzer determines fault outcomes with 96% accuracy, averaged across all the applications studied here.

dependable systems and networks | 2012

Low-cost program-level detectors for reducing silent data corruptions

Siva Kumar Sastry Hari; Sarita V. Adve; Helia Naeimi

With technology scaling, transient faults are becoming an increasing threat to hardware reliability. Commodity systems must be made resilient to these in-field faults through very low-cost resiliency solutions. Software-level symptom detection techniques have emerged as promising low-cost and effective solutions. While the current user-visible Silent Data Corruption (SDC) rates for these techniques is relatively low, eliminating or significantly lowering the SDC rate is crucial for these solutions to become practically successful. Identifying and understanding program sections that cause SDCs is crucial to reducing (or eliminating) SDCs in a cost effective manner. This paper provides a detailed analysis of code sections that produce over 90% of SDCs for six applications we studied. This analysis facilitated the development of program-level detectors that catch errors in quantities that are either accumulated or active for a long duration, amortizing the detection costs. These low-cost detectors significantly reduce the dependency on redundancy-based techniques and provide more practical and flexible choice points on the performance vs. reliability trade-off curve. For example, for an average of 90%, 99%, or 100% reduction of the baseline SDC rate, the average execution overheads of our approach versus redundancy alone are respectively 12% vs. 30%, 19% vs. 43%, and 27% vs. 51%.

international test conference | 2010

QED: Quick Error Detection tests for effective post-silicon validation

Ted Hong; Yanjing Li; Sung-Boem Park; Diana Mui; David Lin; Ziyad Abdel Kaleq; Nagib Hakim; Helia Naeimi; Donald S. Gardner; Subhasish Mitra

Long error detection latency, the time elapsed between the occurrence of an error caused by a bug and its manifestation as a system-level failure, is a major challenge in post-silicon validation of robust systems. In this paper, we present a new technique called Quick Error Detection (QED), which transforms existing post-silicon validation tests into new validation tests that significantly reduce error detection latency. QED transformations allow flexible tradeoffs between error detection latency, coverage, and complexity, and can be implemented in software with little or no hardware changes. Results obtained from hardware experiments on quad-core Intel® Core™ i7 hardware platforms and from simulations on a multi-core MIPS processor design demonstrate that: 1. QED significantly improves error detection latencies by six orders of magnitude, i.e., from billions of cycles to a few thousand cycles or less. 2. QED transformations do not degrade the coverage of validation tests as estimated empirically by measuring the maximum operating frequencies over a wide range of operating voltage points. 3. QED tests improve coverage by detecting errors that escape the original non-QED tests.

design, automation, and test in europe | 2010

Design techniques for cross-layer resilience

Nicholas P. Carter; Helia Naeimi; Donald S. Gardner

Current electronic systems implement reliability using only a few layers of the system stack, which simplifies the design of other layers but is becoming increasingly expensive over time. In contrast, cross-layer resilient systems, which distribute the responsibility for tolerating errors, device variation, and aging across the system stack, have the potential to provide the resilience required to implement reliable, high-performance, low-power systems in future fabrication processes at significantly lower cost. These systems can implement less-frequent resilience tasks in software to save power and chip area, can tune their reliability guarantees to the needs of applications, and can use the information available at each level in the system stack to optimize performance and power consumption. In this paper, we outline an approach to cross-layer system design that describes resilience as a set of tasks that systems must perform in order to detect and tolerate errors and variation. We then present strawman examples of how this task-based design process could be used to implement general-purpose computing and SoC systems, drawing on previous work and identifying key areas for future research.

defect and fault tolerance in vlsi and nanotechnology systems | 2007

Fault Secure Encoder and Decoder for Memory Applications

Helia Naeimi; André DeHon

We introduce a reliable memory system that can tolerate multiple transient errors in the memory words as well as transient errors in the encoder and decoder (corrector) circuitry. The key novel development is the fault-secure detector (FSD) error-correcting code (ECC) definition and associated circuitry that can detect errors in the received encoded vector despite experiencing multiple transient faults in its circuitry. The structure of the detector is general enough that it can be used for any ECC that follows our FSD-ECC definition. We prove that two known classes of Low-Density Parity-Check Codes have the FSD-ECC property: Euclidean Geometry and Projective Geometry codes. We identify a specific FSD-LDPC code that can tolerate up to 33 errors in each memory word or supporting logic that requires only 30% area overhead for memory blocks of 10 Kbits or larger. Larger codes can achieve even higher reliability and lower area overhead. We quantify the importance of protecting encoder and decoder (corrector) circuitry and illustrate a scenario where the system failure rate (FIT) is dominated by the failure rate of the encoder and decoder.

international symposium on computer architecture | 2014

GangES: gang error simulation for hardware resiliency evaluation

Siva Kumar Sastry Hari; Radha Venkatagiri; Sarita V. Adve; Helia Naeimi

As technology scales, the hardware reliability challenge affects a broad computing market, rendering traditional redundancy based solutions too expensive. Software anomaly based hardware error detection has emerged as a low cost reliability solution, but suffers from Silent Data Corruptions (SDCs). It is crucial to accurately evaluate SDC rates and identify SDC producing software locations to develop software-centric low-cost hardware resiliency solutions.A recent tool, called Relyzer, systematically analyzes an entire applications resiliency to single bit soft-errors using a small set of carefully selected error injection sites. Relyzer provides a practical resiliency evaluation mechanism but still requires significant evaluation time, most of which is spent on error simulations. This paper presents a new technique called GangES (Gang Error Simulator) that aims to reduce error simulation time. GangES observes that a set or gang of error simulations that result in the same intermediate execution state (after their error injections) will produce the same error outcome; therefore, only one simulation of the gang needs to be completed, resulting in significant overall savings in error simulation time. GangES leverages program structure to carefully select when to compare simulations and what state to compare. For our workloads, GangES saves 57% of the total error simulation time with an overhead ofjust 1.6%. This paper also explores pure program analyses based techniques that could obviate the needfor tools such as GangES altogether. The availability of Relyzer+GangES allows us to perform a detailed evaluation of such techniques. We evaluate the accuracy of several previously proposed program metrics. We find that the metrics we considered and their various linear combinations are unable to adequately predict an instructions vulnerability to SDCs, further motivating the use of Relyzer+GangES style techniques as valuable solutions for the hardware error resiliency evaluation problem.

Explore More