Shubhendu S. Mukherjee

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shubhendu S. Mukherjee is active.

Explore More

Publication

Featured researches published by Shubhendu S. Mukherjee.

international symposium on microarchitecture | 2003

A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor

Shubhendu S. Mukherjee; Christopher T. Weaver; Joel S. Emer; Steven K. Reinhardt; Todd M. Austin

Single-event upsets from particle strikes have become a key challenge in microprocessor design. Techniques to deal with these transients faults exist, but come at a cost. Designers clearly require accurate estimates of processor error rates to make appropriate cost/reliability tradeoffs. This paper describes a method for generating these estimates. A key aspect of this analysis is that some single-bit faults (such as those occurring in the branch predictor) do not produce an error in a programs output. We define a structures architectural vulnerability factor (AVF) as the probability that a fault in that particular structure do not result in an error. A structures error rate is the product of its raw error rate, as determined by process and circuit technology, and the AVF. Unfortunately, computing AVFs of complex structures, such as the instruction queue, can be quite involved. We identify numerous cases, such as prefetches, dynamically dead code, and wrong-path instructions, in which a fault do not affect, correct execution. We instrument a detailed 1A64 processor simulator to map bit-level microarchitectural state to these cases, generating per-structure AVF estimates. This analysis shows AVFs of 28% and 9% for the instruction queue and execution units, respectively, averaged across dynamic sections of the entire CPU2000 benchmark suite.

international symposium on computer architecture | 2000

Transient fault detection via simultaneous multithreading

Steven K. Reinhardt; Shubhendu S. Mukherjee

Smaller feature sizes, reduced voltage levels, higher transistor counts, and reduced noise margins make future generations of microprocessors increasingly prone to transient hardware faults. Most commercial fault-tolerant computers use fully replicated hardware components to detect microprocessor faults. The components are lockstepped (cycle-by-cycle synchronized) to ensure that, in each cycle, they perform the same operation on the same inputs, producing the same outputs in the absence of faults. Unfortunately, for a given hardware budget, full replication reduces performance by statically partitioning resources among redundant operations. We demonstrate that a Simultaneous and Redundantly Threaded (SRT) processor-derived from a a Simultaneous Multithreaded (SMT) processor-provides transient fault coverage with significantly higher performance. An SRT processor provides transient fault coverage by running identical copies for the same program simultaneously as independent threads. An SRT processor provides higher performance because it dynamically schedules its hardware resources among the redundant copies. However, dynamic scheduling makes is difficult to implement lockstepping, because corresponding instructions from redundant threads may not execute in the same cycle or in the same order. This paper makes four contributions to the design of SRT processors. First, we introduce the concept of the sphere of replication, which abstract both the physical redundancy of a lockstepped system and the logical redundancy of an SRT processor. This framework aids in identifying the scope of fault coverage and the input and output values requiring special handling. Second, we identify two viable spheres of replication in an SRT processor, and show that one of them provides fault detection while checking only committed stores and uncached loads. Third, we identify the need for consistent replication of load values, and propose and evaluate two new mechanisms for satisfying this requirement. Finally, we propose and evaluate two mechanisms-slack fetch and branch outcome queue-that enhance the performance of an SRT processor by allowing one thread to prefetch cache misses and branch results for the other thread. Our results with 11 SPEC95 benchmarks show that an SRT processor can outperform an equivalently sized, on-chip, hardware-replicated solution by 16% on average, with maximum benefit of up to 29%.

international symposium on computer architecture | 2002

Detailed design and evaluation of redundant multi-threading alternatives

Shubhendu S. Mukherjee; Michael Kontz; Steven K. Reinhardt

Exponential growth in the number of on-chip transistors, coupled with reductions in voltage levels, makes each generation of microprocessors increasingly vulnerable to transient faults. In a multithreaded environment, we can detect these faults by running two copies of the same program as separate threads, feeding them identical inputs, and comparing their outputs, a technique we call Redundant Multithreading (RMT).This paper studies RMT techniques in the context of both single- and dual-processor simultaneous multithreaded (SMT) single-chip devices. Using a detailed, commercial-grade, SMT processor design we uncover subtle RMT implementation complexities, and find that RMT can be a more significant burden for single-processor devices than prior studies indicate. However, a novel application of RMT techniques in a dual-processor device, which we term chip-level redundant threading (CRT), shows higher performance than lockstepping the two cores, especially on multithreaded workloads.

high-performance computer architecture | 2005

The soft error problem: an architectural perspective

Shubhendu S. Mukherjee; Joel S. Emer; Steven K. Reinhardt

Radiation-induced soft errors have emerged as a key challenge in computer system design. If the industry is to continue to provide customers with the level of reliability they expect, microprocessor architects must address this challenge directly. This effort has two parts. First, architects must understand the impact of soft errors on their designs. Second, they must select judiciously from among available techniques to reduce this impact in order to meet their reliability targets with minimum overhead. To provide a foundation for these efforts, this paper gives a broad overview of the soft error problem from an architectural perspective. We start with basic definitions, followed by a description of techniques to compute the soft error rate. Then, we summarize techniques used to reduce the soft error rate. This paper also describes problems with double-bit errors. Finally, this paper outlines future directions for architecture research in soft errors.

international symposium on computer architecture | 2004

Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor

Christopher T. Weaver; Joel S. Emer; Shubhendu S. Mukherjee; Steven K. Reinhardt

Transient faults due to neutron and alpha particle strikes pose a significant obstacle to increasing processor transistor counts in future technologies. Although fault rates of individual transistors may not rise significantly, incorporating more transistors into a device makes that device more likely to encounter a fault. Hence, maintaining processor error rates at acceptable levels will require increasing design effort. This paper proposes two simple approaches to reduce error rates and evaluates their application to a microprocessor instruction queue. The first technique reduces the time instructions sit in vulnerable storage structures by selectively squashing instructions when long delays are encountered. A fault is less likely to cause an error if the structure it affects does not contain valid instructions. We introduce a new metric, MITF (Mean Instructions To Failure), to capture the trade-off between performance and reliability introduced by this approach. The second technique addresses false detected errors. In the absence of a fault detection mechanism, such errors would not have affected the final outcome of a program. For example, a fault affecting the result of a dynamically dead instruction would not change the final program output, but could still be flagged by the hardware as an error. To avoid signalling such false errors, we modify a pipelines error detection logic to mark affected instructions and data as possibly incorrect rather than immediately signaling an error. Then, we signal an error only if we determine later that the possibly incorrect value could have affected the programs output.

high performance interconnects | 2001

The Alpha 21364 network architecture

Shubhendu S. Mukherjee; Peter J. Bannon; Steven Lang; Aaron Spink; David Wightman Webb

The Alpha 21364 processor provides a high-performance, highly scalable, and highly reliable network architecture. The router runs at 1.2 GHz and routes packets at a peak bandwidth of 22.4 GB/s. The network architecture scales up to a 128-processor configuration, which can support up to four terabytes of distributed Rambus memory and hundreds of terabytes of disk storage. The distributed Rambus memory is kept coherent via a scalable, directory-based cache coherence scheme. The network also provides a variety of reliability features, such as per-flit ECC. These features make the 21364 network architecture well-suited to support communication-intensive server applications.

IEEE Concurrency | 2000

Wisconsin Wind Tunnel II: a fast, portable parallel architecture simulator

Shubhendu S. Mukherjee; Steven K. Reinhardt; Babak Falsafi; Mike Litzkow; Mark D. Hill; David A. Wood; Steven Huss-Lederman; James R. Larus

To analyze new parallel computers, developers must rapidly simulate designs running realistic workloads. Historically, direct execution and a parallel host have accelerated simulations, although these techniques have typically lacked portability. Through four key operations, the Wisconsin Wind Tunnel II can easily run simulations on Sparc platforms ranging from a workstation cluster to an asymmetric multiprocessor.

international symposium on computer architecture | 2005

Design and Evaluation of Hybrid Fault-Detection Systems

George A. Reis; Jonathan Chang; Neil Vachharajani; Ram Rangan; David I. August; Shubhendu S. Mukherjee

As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Up to now, system designers have primarily considered hardware-only and software-only fault-detection mechanisms to identify and mitigate the deleterious effects of transient faults. These two fault-detection systems, however, are extremes in the design space, representing sharp trade-offs between hardware cost, reliability, and performance. In this paper, we identify hybrid hardware/software fault-detection mechanisms as promising alternatives to hardware-only and software-only systems. These hybrid systems offer designers more options to fit their reliability needs within their hardware and performance budgets. We propose and evaluate CRAFT, a suite of three such hybrid techniques, to illustrate the potential of the hybrid approach. For fair, quantitative comparisons among hardware, software, and hybrid systems, we introduce a new metric, mean work to failure, which is able to compare systems for which machine instructions do not represent a constant unit of work. Additionally, we present a new simulation framework which rapidly assesses reliability and does not depend on manual identification of failure modes. Our evaluation illustrates that CRAFT, and hybrid techniques in general, offer attractive options in the fault-detection design space.

international symposium on computer architecture | 2009

Architectural core salvaging in a multi-core processor for hard-error tolerance

Michael D. Powell; Arijit Biswas; Shantanu Gupta; Shubhendu S. Mukherjee

The incidence of hard errors in CPUs is a challenge for future multicore designs due to increasing total core area. Even if the location and nature of hard errors are known a priori, either at manufacture-time or in the field, cores with such errors must be disabled in the absence of hard-error tolerance. While caches, with their regular and repetitive structures, are easily covered against hard errors by providing spare arrays or spare lines, structures within a core are neither as regular nor as repetitive. Previous work has proposed microarchitectural core salvaging to exploit structural redundancy within a core and maintain functionality in the presence of hard errors. Unfortunately microarchitectural salvaging introduces complexity and may provide only limited coverage of core area against hard errors due to a lack of natural redundancy in the core. This paper makes a case for architectural core salvaging. We observe that even if some individual cores cannot execute certain operations, a CPU die can be instruction-set-architecture (ISA) compliant, that is execute all of the instructions required by its ISA, by exploiting natural cross-core redundancy. We propose using hardware to migrate offending threads to another core that can execute the operation. Architectural core salvaging can cover a large core area against faults, and be implemented by leveraging known techniques that minimize changes to the microarchitecture. We show it is possible to optimize architectural core salvaging such that the performance on a faulty die approaches that of a fault-free die--assuring significantly better performance than core disabling for many workloads and no worse performance than core disabling for the remainder.

acm sigplan symposium on principles and practice of parallel programming | 1995

Efficient support for irregular applications on distributed-memory machines

Shubhendu S. Mukherjee; Shamik D. Sharma; Mark D. Hill; James R. Larus; Anne Rogers; Joel H. Saltz

Irregular computation problems underlie many important scientific applications. Although these problems are computationally expensive, and so would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance. This paper explores three issues—partitioning, mutual exclusion, and data transfer—crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three alternative systems on the same hardware base (a Thinking Machines CM-5): the CHAOS irregular application library, Transparent Shared Memory (TSM), and eXtensible Shared Memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems were somewhat (13%) to significantly faster (991%) than TSM.

Explore More