Valeria Bertacco | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Valeria Bertacco is active.

Explore More

Publication

Featured researches published by Valeria Bertacco.

design, automation, and test in europe | 2009

A highly resilient routing algorithm for fault-tolerant NoCs

David Fick; Andrew DeOrio; Gregory K. Chen; Valeria Bertacco; Dennis Sylvester; David T. Blaauw

Current trends in technology scaling foreshadow worsening transistor reliability as well as greater numbers of transistors in each system. The combination of these factors will soon make long-term product reliability extremely difficult in complex modern systems such as systems on a chip (SoC) and chip multiprocessor (CMP) designs, where even a single device failure can cause fatal system errors. Resiliency to device failure will be a necessary condition at future technology nodes. In this work, we present a network-on-chip (NoC) routing algorithm to boost the robustness in interconnect networks, by reconfiguring them to avoid faulty components while maintaining connectivity and correct operation. This distributed algorithm can be implemented in hardware with less than 300 gates per network router. Experimental results over a broad range of 2D-mesh and 2D-torus networks demonstrate 99.99% reliability on average when 10% of the interconnect links have failed.

design automation conference | 2009

Vicis: a reliable network for unreliable silicon

David Fick; Andrew DeOrio; Jin Hu; Valeria Bertacco; David T. Blaauw; Dennis Sylvester

Process scaling has given designers billions of transistors to work with. As feature sizes near the atomic scale, extensive variation and wearout inevitably make margining uneconomical or impossible. The ElastIC project seeks to address this by creating a large-scale chip-multiprocessor that can self-diagnose, adapt, and heal. Creating large, flexible designs in this environment naturally lends itself to the repetitive nature of network-on-chip (NoC), but the loss of a single link or router will result in complete network failure. In this work we present Vicis, an ElastIC-style NoC that can tolerate the loss of many network components due to wearout induced hard faults. Vicis uses the inherent redundancy in the network and its routers in order to maintain correct operation while incurring a much lower area overhead than previously proposed N-modular redundancy (NMR) based solutions. Each router has a built-in-self-test (BIST) that diagnoses the locations of hard fault and runs a number of algorithms to best use ECC, port swapping, and a crossbar bypass bus to mitigate them. The routers work together to run distributed algorithms to solve network-wide problems as well, protecting the networking against critical failures in individual routers. In this work we show that with stuck-at fault rates as high as 1 in 2000 gates, Vicis will continue to operate with approximately half of its routers still functional and communicating.

high-performance computer architecture | 2006

BulletProof: a defect-tolerant CMP switch architecture

Kypros Constantinides; Stephen M. Plaza; Jason A. Blome; Bin Zhang; Valeria Bertacco; Scott A. Mahlke; Todd M. Austin; Michael Orshansky

As silicon technologies move into the nanometer regime, transistor reliability is expected to wane as devices become subject to extreme process variation, particle-induced transient errors, and transistor wear-out. Unless these challenges are addressed, computer vendors can expect low yields and short mean-times-to-failure. In this paper, we examine the challenges of designing complex computing systems in the presence of transient and permanent faults. We select one small aspect of a typical chip multiprocessor (CMP) system to study in detail, a single CMP router switch. To start, we develop a unified model of faults, based on the time-tested bathtub curve. Using this convenient abstraction, we analyze the reliability versus area tradeoff across a wide spectrum of CMP switch designs, ranging from unprotected designs to fully protected designs with online repair and recovery capabilities. Protection is considered at multiple levels from the entire system down through arbitrary partitions of the design. To better understand the impact of these faults, we evaluate our CMP switch designs using circuit-level timing on detailed physical layouts. Our experimental results are quite illuminating. We find that designs are attainable that can tolerate a larger number of defects with less overhead than naive triple-modular redundancy, using domain-specific techniques such as end-to-end error detection, resource sparing, automatic circuit decomposition, and iterative diagnosis and reconfiguration.

architectural support for programming languages and operating systems | 2006

Ultra low-cost defect protection for microprocessor pipelines

Smitha Shyam; Kypros Constantinides; Sujay Phadke; Valeria Bertacco; Todd M. Austin

The sustained push toward smaller and smaller technology sizes has reached a point where device reliability has moved to the forefront of concerns for next-generation designs. Silicon failure mechanisms, such as transistor wearout and manufacturing defects, are a growing challenge that threatens the yield and product lifetime of future systems. In this paper we introduce the BulletProof pipeline, the first ultra low-cost mechanism to protect a microprocessor pipeline and on-chip memory system from silicon defects. To achieve this goal we combine area-frugal on-line testing techniques and system-level checkpointing to provide the same guarantees of reliability found in traditional solutions, but at much lower cost. Our approach utilizes a microarchitectural checkpointing mechanism which creates coarse-grained epochs of execution, during which distributed on-line built in self-test (BIST) mechanisms validate the integrity of the underlying hardware. In case a failure is detected, we rely on the natural redundancy of instructionlevel parallel processors to repair the system so that it can still operate in a degraded performance mode. Using detailed circuit-level and architectural simulation, we find that our approach provides very high coverage of silicon defects (89%) with little area cost (5.8%). In addition, when a defect occurs, the subsequent degraded mode of operation was found to have only moderate performance impacts, (from 4% to 18% slowdown).

international conference on computer aided design | 1997

The disjunctive decomposition of logic functions

Valeria Bertacco; Maurizio Damiani

We present an algorithm for extracting a disjunctive decomposition from the BDD representation of a logic function F. The output of the algorithm is a multiple-level netlist exposing the hierarchical decomposition structure of the function. The algorithm has theoretical quadratic complexity in the size of the input BDD. Experimentally, we were able to decompose most synthesis benchmarks in less than one second of CPU time, and to report on the decomposability of several complex ISCAS combinational benchmarks. We found the final netlist to be often close to the output of more complex dedicated optimization tools.

asia and south pacific design automation conference | 2005

Opportunities and challenges for better than worst-case design

Todd M. Austin; Valeria Bertacco; David T. Blaauw; Trevor N. Mudge

The progressive trend of fabrication technologies towards the nanometer regime has created a number of new physical design challenges for computer architects. Design complexity, uncertainty in environmental and fabrication conditions, and single-event upsets all conspire to compromise system correctness and reliability. Recently, researchers have begun to advocate a new design strategy called better than worst-case design that couples a complex core component with a simple reliable checker mechanism. By delegating the responsibility for correctness and reliability of the design to the checker, it becomes possible to build probably correct designs that effectively address the challenges of deep submicron design. In this paper, we present the concepts of better than worst-case design and highlight two exemplary designs: the DIVA checker and Razor logic. We show how this approach to system implementation relaxes design constraints on core components, which reduces the effects of physical design challenges and creates opportunities to optimize performance and power characteristics. We demonstrate the advantages of relaxed design constraints for the core components by applying typical case optimization (TCO) techniques to an adder circuit. Finally, we discuss the challenges and opportunities posed to CAD tools in the context of better than worst-case design. In particular, we describe the additional support required for analyzing run time characteristics of designs and the many opportunities which are created to incorporate typical-case optimizations into synthesis and verification.

international symposium on microarchitecture | 2007

Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation

Kypros Constantinides; Onur Mutlu; Todd M. Austin; Valeria Bertacco

As silicon process technology scales deeper into the nanometer regime, hardware defects are becoming more common. Such defects are bound to hinder the correct operation of future processor systems, unless new online techniques become available to detect and to tolerate them while preserving the integrity of software applications running on the system. This paper proposes a new, software-based, defect detection and diagnosis technique. We introduce a novel set of instructions, called access-control extension (ACE), that can access and control the microprocessors internal state. Special firmware periodically suspends microprocessor execution and uses the ACE instructions to run directed tests on the hardware. When a hardware defect is present, these tests can diagnose and locate it, and then activate system repair through resource reconfiguration. The software nature of our framework makes it flexible: testing techniques can be modified/upgraded in the field to trade off performance with reliability without requiring any change to the hardware. We evaluated our technique on a commercial chip-multiprocessor based on Suns Niagara and found that it can provide very high coverage, with 99.22% of all silicon defects detected. Moreover, our results show that the average performance overhead of software-based testing is only 5.5%. Based on a detailed RTL-level implementation of our technique, we find its area overhead to be quite modest, with only a 5.8% increase in total chip area.

IEEE Design & Test of Computers | 2008

Reliable Systems on Unreliable Fabrics

Todd M. Austin; Valeria Bertacco; Scott A. Mahlke; Yu Cao

The continued scaling of silicon fabrication technology has led to significant reliability concerns, which are quickly becoming a dominant design challenge. Design integrity is threatened by complexity challenges in the form of immense designs defying complete verification, and physical challenges such as silicon aging and soft errors, which impair correct system operation. The Gigascale Systems Research Center Resilient-System Design Team is addressing these key challenges through synergistic research thrusts, ranging from near-term reliability stress reduction techniques to methods for improving the quality of todays silicon, to longer-term technologies that can detect, recover, and repair faulty systems. These efforts are supported and complemented by an active fault-modeling research effort and a strong focus on functional-verification methodologies. The teams goal is to provide highly effective, low-cost solutions to ensure both correctness and reliability in future designs and technology nodes, thereby extending the lifetime of silicon fabrication technologies beyond what can be currently foreseen as profitable.

design automation conference | 2009

Event-driven gate-level simulation with GP-GPUs

Debapriya Chatterjee; Andrew DeOrio; Valeria Bertacco

Logic simulation is a critical component of the design tool flow in modern hardware development efforts. It is used widely from high level descriptions down to gate level ones to validate several aspects of the design, particularly functional correctness. Despite development houses investing vast resources in the simulation task, particularly at the gate level, it is still far from achieving the performance demands required to validate complex modern designs. In this work, we propose the first event driven logic simulator accelerated by a parallel, general purpose graphics processor (GPGPU). Our simulator leverages a gate level event driven design to exploit the benefits of the low switching activity that is typical of large hardware designs. We developed novel algorithms for circuit netlist partitioning and optimized for a highly parallel GPGPU host. Moreover, our flow is structured to extract the best simulation performance from the target hardware platform. We found that our experimental prototype could handle large, industrial scale designs comprised of millions of gates and deliver a 13x speedup on average over current commercial event driven simulators.

international conference on parallel architectures and compilation techniques | 2011

ARIADNE: Agnostic Reconfiguration in a Disconnected Network Environment

Konstantinos Aisopos; Andrew DeOrio; Li Shiuan Peh; Valeria Bertacco

Extreme transistor technology scaling is causing increasing concerns in device reliability: the expected lifetime of individual transistors in complex chips is quickly decreasing, and the problem is expected to worsen at future technology nodes. With complex designs increasingly relying on Networks-on-Chip (NoCs) for on-chip data transfers, a NoC must continue to operate even in the face of many transistor failures. Specifically, it must be able to reconfigure and reroute packets around faults to enable continued operation, i.e., generate new routing paths to replace the old ones upon a failure. In addition to these reliability requirements, NoCs must maintain low latency and high throughput at very low area budget. In this work, we propose a distributed reconfiguration solution named Ariadne, targeting large, aggressively scaled, unreliable NoCs. Ariadne utilizes up*/down* for fast routing at high bandwidth, and upon any number of concurrent network failures in any location, it reconfigures to discover new resilient paths to connect the surviving nodes. Experimental results show that Ariadne provides a 40%-140% latency improvement (when subject to 50 faults in a 64-node NoC) over other on-chip state-of-the-art fault tolerant solutions, while meeting the low area budget of on-chip routers with an overhead of just 1.97%.

Explore More