Kypros Constantinides

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kypros Constantinides is active.

Explore More

Publication

Featured researches published by Kypros Constantinides.

high-performance computer architecture | 2006

BulletProof: a defect-tolerant CMP switch architecture

Kypros Constantinides; Stephen M. Plaza; Jason A. Blome; Bin Zhang; Valeria Bertacco; Scott A. Mahlke; Todd M. Austin; Michael Orshansky

As silicon technologies move into the nanometer regime, transistor reliability is expected to wane as devices become subject to extreme process variation, particle-induced transient errors, and transistor wear-out. Unless these challenges are addressed, computer vendors can expect low yields and short mean-times-to-failure. In this paper, we examine the challenges of designing complex computing systems in the presence of transient and permanent faults. We select one small aspect of a typical chip multiprocessor (CMP) system to study in detail, a single CMP router switch. To start, we develop a unified model of faults, based on the time-tested bathtub curve. Using this convenient abstraction, we analyze the reliability versus area tradeoff across a wide spectrum of CMP switch designs, ranging from unprotected designs to fully protected designs with online repair and recovery capabilities. Protection is considered at multiple levels from the entire system down through arbitrary partitions of the design. To better understand the impact of these faults, we evaluate our CMP switch designs using circuit-level timing on detailed physical layouts. Our experimental results are quite illuminating. We find that designs are attainable that can tolerate a larger number of defects with less overhead than naive triple-modular redundancy, using domain-specific techniques such as end-to-end error detection, resource sparing, automatic circuit decomposition, and iterative diagnosis and reconfiguration.

architectural support for programming languages and operating systems | 2006

Ultra low-cost defect protection for microprocessor pipelines

Smitha Shyam; Kypros Constantinides; Sujay Phadke; Valeria Bertacco; Todd M. Austin

The sustained push toward smaller and smaller technology sizes has reached a point where device reliability has moved to the forefront of concerns for next-generation designs. Silicon failure mechanisms, such as transistor wearout and manufacturing defects, are a growing challenge that threatens the yield and product lifetime of future systems. In this paper we introduce the BulletProof pipeline, the first ultra low-cost mechanism to protect a microprocessor pipeline and on-chip memory system from silicon defects. To achieve this goal we combine area-frugal on-line testing techniques and system-level checkpointing to provide the same guarantees of reliability found in traditional solutions, but at much lower cost. Our approach utilizes a microarchitectural checkpointing mechanism which creates coarse-grained epochs of execution, during which distributed on-line built in self-test (BIST) mechanisms validate the integrity of the underlying hardware. In case a failure is detected, we rely on the natural redundancy of instructionlevel parallel processors to repair the system so that it can still operate in a degraded performance mode. Using detailed circuit-level and architectural simulation, we find that our approach provides very high coverage of silicon defects (89%) with little area cost (5.8%). In addition, when a defect occurs, the subsequent degraded mode of operation was found to have only moderate performance impacts, (from 4% to 18% slowdown).

international symposium on microarchitecture | 2007

Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation

Kypros Constantinides; Onur Mutlu; Todd M. Austin; Valeria Bertacco

As silicon process technology scales deeper into the nanometer regime, hardware defects are becoming more common. Such defects are bound to hinder the correct operation of future processor systems, unless new online techniques become available to detect and to tolerate them while preserving the integrity of software applications running on the system. This paper proposes a new, software-based, defect detection and diagnosis technique. We introduce a novel set of instructions, called access-control extension (ACE), that can access and control the microprocessors internal state. Special firmware periodically suspends microprocessor execution and uses the ACE instructions to run directed tests on the hardware. When a hardware defect is present, these tests can diagnose and locate it, and then activate system repair through resource reconfiguration. The software nature of our framework makes it flexible: testing techniques can be modified/upgraded in the field to trade off performance with reliability without requiring any change to the hardware. We evaluated our technique on a commercial chip-multiprocessor based on Suns Niagara and found that it can provide very high coverage, with 99.22% of all silicon defects detected. Moreover, our results show that the average performance overhead of software-based testing is only 5.5%. Based on a detailed RTL-level implementation of our technique, we find its area overhead to be quite modest, with only a 5.8% increase in total chip area.

high-performance computer architecture | 2007

Perturbation-based Fault Screening

Paul B. Racunas; Kypros Constantinides; Srilatha Manne; Shubhendu S. Mukherjee

Fault screeners are a new breed of fault identification technique that can probabilistically detect if a transient fault has affected the state of a processor. We demonstrate that fault screeners function because of two key characteristics. First, we show that much of the intermediate data generated by a program inherently falls within certain consistent bounds. Second, we observe that these bounds are often violated by the introduction of a fault. Thus, fault screeners can identify faults by directly watching for any data inconsistencies arising in an applications behavior. We present an idealized algorithm capable of identifying over 85% of injected faults on the SpecInt suite and over 75% overall. Further, in a realistic implementation on a simulated Pentium-III-like processor, about half of the errors due to injected faults are identified while still in speculative state. Errors detected this early can be eliminated by a pipeline flush. In this paper, we present several hardware-based versions of this screening algorithm and show that flushing the pipeline every time the hardware screener triggers reduces overall performance by less than 1%

international symposium on microarchitecture | 2008

Online design bug detection: RTL analysis, flexible mechanisms, and evaluation

Kypros Constantinides; Onur Mutlu; Todd M. Austin

Higher level of resource integration and the addition of new features in modern multi-processors put a significant pressure on their verification. Although a large amount of resources and time are devoted to the verification phase of modern processors, many design bugs escape the verification process and slip into processors operating in the field. These design bugs often lead to lower quality products, lower customer satisfaction, diminishing brand/company reputation, or even expensive product recalls. This paper proposes a flexible, low-overhead mechanism to detect the occurrence of design bugs during on-line operation. First, we analyze the actual design bugs found and fixed in a commercial chip- multiprocessor, Suns OpenSPARC Tl, to understand the behavior and characteristics of design bugs. Our RTL analysis of design bugs shows that the number of signals that need to be monitored to detect design bugs is significantly larger than suggested by previous studies that analyzed design bugs at a higher level using processor errata sheets. Second, based on the insights obtained from our analyses, we propose a programmable, distributed online design bug detection mechanism that incorporates the monitoring of bugs into the flip-flops of the design. The key contribution of our mechanism is its ability to monitor all control signals in the design rather than a set of signals selected at design time. As a result, it is very flexible: when a bug is discovered after the processor is shipped, it can be detected by monitoring the set of control signals that trigger the design bug. We develop an RTL prototype implementation of our mechanism on the OpenSPARC Tl chip multiprocessor. We found its area overhead to be 10% and its power consumption overhead to be 3.5% over the whole OpenSPARC Tl chip.

international conference on computer design | 2008

CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework

Andrea Pellegrini; Kypros Constantinides; Dan Zhang; Shobana Sudhakar; Valeria Bertacco; Todd M. Austin

Extreme scaling practices in silicon technology are quickly leading to integrated circuit components with limited reliability, where phenomena such as early-transistor failures, gate-oxide wearout, and transient faults are becoming increasingly common. In order to overcome these issues and develop robust design techniques for large-market silicon ICs, it is necessary to rely on accurate failure analysis frameworks which enable design houses to faithfully evaluate both the impact of a wide range of potential failures and the ability of candidate reliable mechanisms to overcome them. Unfortunately, while failure rates are already growing beyond economically viable limits, no fault analysis framework is yet available that is both accurate and can operate on a complex integrated system. To address this void, we present CrashTest, a fast, high-fidelity and flexible resiliency analysis system. Given a hardware description model of the design under analysis, CrashTest is capable of orchestrating and performing a comprehensive design resiliency analysis by examining how the design reacts to faults while running software applications. Upon completion, CrashTest provides a high-fidelity analysis report obtained by performing a fault injection campaign at the gate-level netlist of the design. The fault injection and analysis process is significantly accelerated by the use of an FPGA hardware emulation platform. We conducted experimental evaluations on a range of systems, including a complex LEON-based system-on-chip, and evaluated the impact of gate-level injected faults at the system level. We found that CrashTest is 16-90x faster than an equivalent software-based framework, when analyzing designs through direct primary I/Os. As shown by our LEON-based SoC experiments, CrashTest exhibits emulation speeds that are six orders of magnitude faster than simulation.

IEEE Transactions on Computers | 2009

A Flexible Software-Based Framework for Online Detection of Hardware Defects

Kypros Constantinides; Onur Mutlu; Todd M. Austin; Valeria Bertacco

This work proposes a new, software-based, defect detection and diagnosis technique. We introduce a novel set of instructions, called access-control extensions (ACE), that can access and control the microprocessors internal state. Special firmware periodically suspends microprocessor execution and uses the ACE instructions to run directed tests on the hardware. When a hardware defect is present, these tests can diagnose and locate it, and then activate system repair through resource reconfiguration. The software nature of our framework makes it flexible: testing techniques can be modified/upgraded in the field to trade-off performance with reliability without requiring any change to the hardware. We describe and evaluate different execution models for using the ACE framework. We also describe how the proposed ACE framework can be extended and utilized to improve the quality of post-silicon debugging and manufacturing testing of modern processors. We evaluated our technique on a commercial chip-multiprocessor based on Suns Niagara and found that it can provide very high coverage, with 99.22 percent of all silicon defects detected. Moreover, our results show that the average performance overhead of software-based testing is only 5.5 percent. Based on a detailed register transfer level (RTL) implementation of our technique, we find its area and power consumption overheads to be modest, with a 5.8 percent increase in total chip area and a 4 percent increase in the chips overall power consumption.

design, automation, and test in europe | 2007

Low-cost protection for SER upsets and silicon defects

Mojtaba Mehrara; Mona Attariyan; Smitha Shyam; Kypros Constantinides; Valeria Bertacco; Todd M. Austin

Extreme transistor scaling trends in silicon technology are soon to reach a point where manufactured systems will suffer from limited device reliability and severely reduced life-time, due to early transistor failures, gate oxide wear-out, manufacturing defects, and radiation-induced soft errors (SER). In this paper we present a low-cost technique to harden a microprocessor pipeline and caches against these reliability threats. Our approach utilizes online built-in self-test (BIST) and microarchitectural checkpointing to detect, diagnose and recover the computation impaired by silicon defects or SER events. The approach works by periodically testing the processor to determine if the system is broken. If so, we reconfigure the processor to avoid using the broken component. A similar mechanism is used to detect SER, faults, with the difference that recovery is implemented by re-execution. By utilizing low-cost techniques to address defects and SER, we keep protection costs significantly lower than traditional fault-tolerance approaches while providing high levels of coverage for a wide range of faults. Using detailed gate-level simulation, we find that our approach provides 95% and 99% coverage for silicon defects and SER events, respectively, with only a 14% area overhead.

ACM Transactions on Architecture and Code Optimization | 2007

Architecting a reliable CMP switch architecture

Kypros Constantinides; Stephen M. Plaza; Jason A. Blome; Valeria Bertacco; Scott A. Mahlke; Todd M. Austin; Bin Zhang; Michael Orshansky

As silicon technologies move into the nanometer regime, transistor reliability is expected to wane as devices become subject to extreme process variation, particle-induced transient errors, and transistor wear-out. Unless these challenges are addressed, computer vendors can expect low yields and short mean-times-to-failure. In this article, we examine the challenges of designing complex computing systems in the presence of transient and permanent faults. We select one small aspect of a typical chip multiprocessor (CMP) system to study in detail, a single CMP router switch. Our goal is to design a BulletProof CMP switch architecture capable of tolerating significant levels of various types of defects. We first assess the vulnerability of the CMP switch to transient faults. To better understand the impact of these faults, we evaluate our CMP switch designs using circuit-level timing on detailed physical layouts. Our infrastructure represents a new level of fidelity in architectural-level fault analysis, as we can accurately track faults as they occur, noting whether they manifest or not, because of masking in the circuits, logic, or architecture. Our experimental results are quite illuminating. We find that transient faults, because of their fleeting nature, are of little concern for our CMP switch, even within large switch fabrics with fast clocks. Next, we develop a unified model of permanent faults, based on the time-tested bathtub curve. Using this convenient abstraction, we analyze the reliability versus area tradeoff across a wide spectrum of CMP switch designs, ranging from unprotected designs to fully protected designs with on-line repair and recovery capabilities. Protection is considered at multiple levels from the entire system down through arbitrary partitions of the design. We find that designs are attainable that can tolerate a larger number of defects with less overhead than naïve triple-modular redundancy, using domain-specific techniques, such as end-to-end error detection, resource sparing, automatic circuit decomposition, and iterative diagnosis and reconfiguration.

high performance embedded architectures and compilers | 2008

The significance of affectors and affectees correlations for branch prediction

Yiannakis Sazeides; Andreas Moustakas; Kypros Constantinides; Marios Kleanthous

This work investigates the potential of direction-correlations to improve branch prediction. There are two types of direction-correlation: affectors and affectees. This work considers for the first time their implications at a basic level. These correlations are determined based on dataflow graph information and are used to select the subset of global branch history bits used for prediction. If this subset is small then affectors and affectees can be useful to cut down learning time, and reduce aliasing in prediction tables. This paper extends previous work explaining why and how correlation-based predictors work by analyzing the properties of direction-correlations. It also shows that branch history selected using oracle knowledge of direction-correlations improves the accuracy of the limit and realistic conditional branch predictors, that won at the recent branch prediction contest, by up to 30% and 17% respectively. The findings in this paper call for the investigation of predictors that can learn efficiently correlations from long branch history that may be non-consecutive with holes between them.

Explore More