Shuguang Feng | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shuguang Feng is active.

Explore More

Publication

Featured researches published by Shuguang Feng.

international symposium on microarchitecture | 2007

Self-calibrating Online Wearout Detection

Jason A. Blome; Shuguang Feng; Shantanu Gupta; Scott A. Mahlke

Technology scaling, characterized by decreasing feature size, thinning gate oxide, and non-ideal voltage scaling, will become a major hindrance to microprocessor reliability in future technology generations. Physical analysis of device failure mechanisms has shown that most wearout mechanisms projected to plague future technology generations are progressive, meaning that the circuit-level effects of wearout develop and intensify with age over the lifetime of the chip. This work leverages the progression of wearout over time in order to present a low-cost hardware structure that identifies increasing propagation delay, which is symptomatic of many forms of wearout, to accurately forecast the failure of microarchitectural structures. To motivate the use of this predictive technique, an HSPICE analysis of the effects of one particular failure mechanism, gate oxide breakdown, on gates from a standard cell library characterized for a 90 nm process is presented. This gate-level analysis is then used to demonstrate the aggregate change in output delay of high-level structures within a synthesized Verilog model of an embedded microprocessor core. Leveraging this analysis, a self- calibrating hardware structure for conducting statistical analysis of output delay is presented and its efficacy in predicting the failure of a variety of structures within the microprocessor core is evaluated.

international symposium on microarchitecture | 2009

ZerehCache: armoring cache architectures in high defect density technologies

Amin Ansari; Shantanu Gupta; Shuguang Feng; Scott A. Mahlke

Aggressive technology scaling to 45 nm and below introduces serious reliability challenges to the design of microprocessors. Large SRAM structures used for caches are particularly sensitive to process variation due to their high density and organization. Designers typically over-provision caches with additional resources to overcome the hard-faults. However, static allocation and binding of redundant resources results in low utilization of the extra resources and ultimately limits the number of defects that can be tolerated. This work re-examines the design of process variation tolerant on-chip caches with the focus on flexibility and dynamic reconfigurability to allow a large number defects to be tolerated with modest hardware overhead. Our approach, ZerehCache, combines redundant data array elements with a permutation network for providing a higher degree of freedom on replacement. A graph coloring algorithm is used to configure the network and find the proper mapping of replacement elements. We perform an extensive design space exploration of both L1/L2 caches to identify several Pareto optimal ZerehCaches. For the yield analysis, a population of 1000 chips was studied at the 45 nm technology node; L1 designs with 16% and an L2 designs with 8% area overheads achieve yields of 99% and 96%, respectively.

international symposium on microarchitecture | 2011

Bundled execution of recurring traces for energy-efficient general purpose processing

Shantanu Gupta; Shuguang Feng; Amin Ansari; Scott A. Mahlke; David I. August

Technology scaling has delivered on its promises of increasing device density on a single chip. However, the voltage scaling trend has failed to keep up, introducing tight power constraints on manufactured parts. In such a scenario, there is a need to incorporate energy-efficient processing resources that can enable more computation within the same power budget. Energy efficiency solutions in the past have typically relied on application specific hardware and accelerators. Unfortunately, these approaches do not extend to general purpose applications due to their irregular and diverse code base. Towards this end, we propose BERET, an energy-efficient co-processor that can be configured to benefit a wide range of applications. Our approach identifies recurring instruction sequences as phases of “temporal regularity” in a programs execution, and maps suitable ones to the BERET hardware, a three-stage pipeline with a bundled execution model. This judicious off-loading of program execution to a reduced-complexity hardware demonstrates significant savings on instruction fetch, decode and register file accesses energy. On average, BERET reduces energy consumption by a factor of 3–4X for the program regions selected across a range of general-purpose and media applications. The average energy savings for the entire application run was 35% over a single-issue in-order processor.

international symposium on microarchitecture | 2008

The StageNet fabric for constructing resilient multicore systems

Shantanu Gupta; Shuguang Feng; Amin Ansari; Jason A. Blome; Scott A. Mahlke

Scaling of CMOS feature size has long been a source of dramatic performance gains. However, the reduction in voltage levels has not been able to match this rate of scaling, leading to increasing operating temperatures and current densities. Given that most wearout mechanisms that plague semiconductor devices are highly dependent on these parameters, significantly higher failure rates are projected for future technology generations. Consequently, high reliability and fault tolerance, which have traditionally been subjects of interest for high-end server markets, are now getting emphasis in the mainstream desktop and embedded systems space. The popular solution for this has been the use of redundancy at a coarse granularity, such as dual/triple modular redundancy. In this work, we challenge the practice of coarse-granularity redundancy by identifying its inability to scale to high failure rate scenarios and investigating the advantages of finer-grained configurations. To this end, this paper presents and evaluates a highly reconfigurable multicore architecture, named StageNet (SN), that is designed with reliability as its first class design criteria. SN relies on a reconfigurable network of replicated processor pipeline stages to maximize the useful lifetime of a chip, gracefully degrading performance towards the end of life. Our results show that the proposed SN architecture can perform nearly 50% more cumulative work compared to a traditional multicore.

compilers, architecture, and synthesis for embedded systems | 2006

Cost-efficient soft error protection for embedded microprocessors

Jason A. Blome; Shantanu Gupta; Shuguang Feng; Scott A. Mahlke

Device scaling trends dramatically increase the susceptibility of microprocessors to soft errors. Further, mounting demand for embedded microprocessors in a wide array of safety critical applications, ranging from automobiles to pacemakers, compounds the importance of addressing the soft error problem. Historically, soft error tolerance techniques have been targeted mainly at high-end server markets, leading to solutions such as coarse-grained modular redundancy and redundant multithreading. However, these techniques tend to be prohibitively expensive to implement in the embedded design space. To address this problem, we first present a thorough analysis of the effects of soft errors on a production-grade, fully synthesized implementation of an ARM926EJ-S embedded microprocessor. We then leverage this analysis in the design of two orthogonal low-costs of terror protection techniques that can be tuned to achieve variable levels of fault coverage as a function of area and power constraints. The first technique uses a small cache of live register values in order to provide nearly twice the fault coverage of a register file protected using traditional error correcting codes at little or no additional area cost. The second technique is a statistical method used to significantly reduce the overhead of deploying time-delayed shadow latches for low-latency fault detection.

high-performance computer architecture | 2011

Archipelago: A polymorphic cache design for enabling robust near-threshold operation

Amin Ansari; Shuguang Feng; Shantanu Gupta; Scott A. Mahlke

Extreme technology integration in the sub-micron regime comes with a rapid rise in heat dissipation and power density for modern processors. Dynamic voltage scaling is a widely used technique to tackle this problem when high performance is not the main concern. However, the minimum achievable supply voltage for the processor is often bounded by the large on-chip caches since SRAM cells fail at a significantly faster rate than logic cells when reducing supply voltage. This is mainly due to the higher susceptibility of the SRAM structures to process-induced parameter variations. In this work, we propose a highly flexible fault-tolerant cache design, Archipelago, that by reconfiguring its internal organization can efficiently tolerate the large number of SRAM failures that arise when operating in the near-threshold region. Archipelago partitions the cache to multiple autonomous islands with various sizes which can operate correctly without borrowing redundancy from each other. Our configuration algorithm — an adapted version of minimum clique covering — exploits the high degree of flexibility in the Archipelago architecture to reduce the granularity of redundancy replacement and minimize the amount of space lost in the cache when operating in near-threshold region. Using our approach, the operational voltage of a processor can be reduced to 375mV, which translates to 79% dynamic and 51% leakage power savings (in 90nm) for a microprocessor similar to the Alpha 21364. These power savings come with a 4.6% performance drop-off when operating in low power mode and 2% area overhead for the microprocessor.

international symposium on microarchitecture | 2011

Encore: low-cost, fine-grained transient fault recovery

Shuguang Feng; Shantanu Gupta; Amin Ansari; Scott A. Mahlke; David I. August

To meet an insatiable consumer demand for greater performance at less power, silicon technology has scaled to unprecedented dimensions. However, the pursuit of faster processors and longer battery life has come at the cost of reliability. Given the rise of processor reliability as a first-order design constraint, there has been a growing interest in low-cost, non-intrusive techniques for transient fault detection. Many of these recent proposals have counted on the availability of hardware recovery mechanisms. Although common in aggressive out-of-order cores, hardware support for speculative rollback and recovery is less common in lower-end commodity processors. This paper presents Encore, a software-based fault recovery mechanism tailored for these lower-cost systems that lack native hardware support for speculative rollback recovery. Encore combines program analysis, profile data, and simple code transformations to create statistically idempotent code regions that can recover from faults at very little cost. Using this software-only, compiler-based approach, Encore provides the ability to recover from transient faults without specialized hardware or the costs of traditional, full-system checkpointing solutions. Experimental results show that Encore, with just 14% of runtime overhead, can safely recover, on average from 97% of transient faults when coupled with existing detection schemes.

high performance embedded architectures and compilers | 2010

Maestro: orchestrating lifetime reliability in chip multiprocessors

Shuguang Feng; Shantanu Gupta; Amin Ansari; Scott A. Mahlke

As CMOS feature sizes venture deep into the nanometer regime, wearout mechanisms including negative-bias temperature instability and time-dependent dielectric breakdown can severely reduce processor operating lifetimes and performance. This paper presents an introspective reliability management system, Maestro, to tackle reliability challenges in future chip multiprocessors (CMPs) head-on. Unlike traditional approaches, Maestro relies on low-level sensors to monitor the CMP as it ages (introspection). Leveraging this real-time assessment of CMP health, runtime heuristics identify wearout-centric job assignments (management). By exploiting the complementary effects of the natural heterogeneity (due to process variation and wearout) that exists in CMPs and the diversity found in system workloads, Maestro composes job schedules that intelligently control the aging process. Monte Carlo experiments show that Maestro significantly enhances lifetime reliability through intelligent wear-leveling, increasing the expected service life of a population of 16-core CMPs by as much as 38% compared to a naive, round-robin scheduler. Furthermore, in the presence of process variation, Maestros wearout-centric scheduling outperformed both performance counter and temperature sensor based schedulers, achieving an order of magnitude more improvement in lifetime throughput – the amount of useful work done by a system prior to failure.

IEEE Transactions on Computers | 2011

StageNet: A Reconfigurable Fabric for Constructing Dependable CMPs

Shantanu Gupta; Shuguang Feng; Amin Ansari; Scott A. Mahlke

CMOS scaling has long been a source of dramatic performance gains. However, semiconductor feature size reduction has resulted in increasing levels of operating temperatures and current densities. Given that most wearout mechanisms are highly dependent on these parameters, significantly higher failure rates are projected for future technology generations. Consequently, fault tolerance, which has traditionally been a subject of interest for high-end server markets, is now getting emphasis in the mainstream computing systems space. The popular solution for this has been the use of redundancy at a coarse granularity, such as dual/triple modular redundancy. In this work, we challenge the practice of coarse-granularity redundancy by identifying its inability to scale to high failure rate scenarios and investigating the advantages of finer-grained configurations. To this end, this paper presents and evaluates a highly reconfigurable CMP architecture, named as StageNet (SN), that is designed with reliability as its first-class design criteria. SN relies on a reconfigurable network of replicated processor pipeline stages to maximize the useful lifetime of a chip, gracefully degrading performance toward the end of life. Our results show that the proposed SN architecture can perform 40 percent more cumulative work compared to a traditional CMP over 12 years of its lifetime.

international symposium on computer architecture | 2010

Necromancer: enhancing system throughput by animating dead cores

Amin Ansari; Shuguang Feng; Shantanu Gupta; Scott A. Mahlke

Aggressive technology scaling into the nanometer regime has led to a host of reliability challenges in the last several years. Unlike on-chip caches, which can be efficiently protected using conventional schemes, the general core area is less homogeneous and structured, making tolerating defects a much more challenging problem. Due to the lack of effective solutions, disabling non-functional cores is a common practice in industry to enhance manufacturing yield, which results in a significant reduction in system throughput. Although a faulty core cannot be trusted to correctly execute programs, we observe in this work that for most defects, when starting from a valid architectural state, execution traces on a defective core actually coarsely resemble those of fault-free executions. In light of this insight, we propose a robust and heterogeneous core coupling execution scheme, Necromancer, that exploits a functionally dead core to improve system throughput by supplying hints regarding high-level program behavior. We partition the cores in a conventional CMP system into multiple groups in which each group shares a lightweight core that can be substantially accelerated using these execution hints from a potentially dead core. To prevent this undead core from wandering too far from the correct path of execution, we dynamically resynchronize architectural state with the lightweight core. For a 4-core CMP system, on average, our approach enables the coupled core to achieve 78.5% of the performance of a fully functioning core. This defect tolerance and throughput enhancement comes at modest area and power overheads of 5.3% and 8.5%, respectively.

Explore More