Amin Ansari | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Amin Ansari is active.

Explore More

Publication

Featured researches published by Amin Ansari.

international symposium on microarchitecture | 2009

ZerehCache: armoring cache architectures in high defect density technologies

Amin Ansari; Shantanu Gupta; Shuguang Feng; Scott A. Mahlke

Aggressive technology scaling to 45 nm and below introduces serious reliability challenges to the design of microprocessors. Large SRAM structures used for caches are particularly sensitive to process variation due to their high density and organization. Designers typically over-provision caches with additional resources to overcome the hard-faults. However, static allocation and binding of redundant resources results in low utilization of the extra resources and ultimately limits the number of defects that can be tolerated. This work re-examines the design of process variation tolerant on-chip caches with the focus on flexibility and dynamic reconfigurability to allow a large number defects to be tolerated with modest hardware overhead. Our approach, ZerehCache, combines redundant data array elements with a permutation network for providing a higher degree of freedom on replacement. A graph coloring algorithm is used to configure the network and find the proper mapping of replacement elements. We perform an extensive design space exploration of both L1/L2 caches to identify several Pareto optimal ZerehCaches. For the yield analysis, a population of 1000 chips was studied at the 45 nm technology node; L1 designs with 16% and an L2 designs with 8% area overheads achieve yields of 99% and 96%, respectively.

international symposium on microarchitecture | 2011

Bundled execution of recurring traces for energy-efficient general purpose processing

Shantanu Gupta; Shuguang Feng; Amin Ansari; Scott A. Mahlke; David I. August

Technology scaling has delivered on its promises of increasing device density on a single chip. However, the voltage scaling trend has failed to keep up, introducing tight power constraints on manufactured parts. In such a scenario, there is a need to incorporate energy-efficient processing resources that can enable more computation within the same power budget. Energy efficiency solutions in the past have typically relied on application specific hardware and accelerators. Unfortunately, these approaches do not extend to general purpose applications due to their irregular and diverse code base. Towards this end, we propose BERET, an energy-efficient co-processor that can be configured to benefit a wide range of applications. Our approach identifies recurring instruction sequences as phases of “temporal regularity” in a programs execution, and maps suitable ones to the BERET hardware, a three-stage pipeline with a bundled execution model. This judicious off-loading of program execution to a reduced-complexity hardware demonstrates significant savings on instruction fetch, decode and register file accesses energy. On average, BERET reduces energy consumption by a factor of 3–4X for the program regions selected across a range of general-purpose and media applications. The average energy savings for the entire application run was 35% over a single-issue in-order processor.

international symposium on microarchitecture | 2008

The StageNet fabric for constructing resilient multicore systems

Shantanu Gupta; Shuguang Feng; Amin Ansari; Jason A. Blome; Scott A. Mahlke

Scaling of CMOS feature size has long been a source of dramatic performance gains. However, the reduction in voltage levels has not been able to match this rate of scaling, leading to increasing operating temperatures and current densities. Given that most wearout mechanisms that plague semiconductor devices are highly dependent on these parameters, significantly higher failure rates are projected for future technology generations. Consequently, high reliability and fault tolerance, which have traditionally been subjects of interest for high-end server markets, are now getting emphasis in the mainstream desktop and embedded systems space. The popular solution for this has been the use of redundancy at a coarse granularity, such as dual/triple modular redundancy. In this work, we challenge the practice of coarse-granularity redundancy by identifying its inability to scale to high failure rate scenarios and investigating the advantages of finer-grained configurations. To this end, this paper presents and evaluates a highly reconfigurable multicore architecture, named StageNet (SN), that is designed with reliability as its first class design criteria. SN relies on a reconfigurable network of replicated processor pipeline stages to maximize the useful lifetime of a chip, gracefully degrading performance towards the end of life. Our results show that the proposed SN architecture can perform nearly 50% more cumulative work compared to a traditional multicore.

high-performance computer architecture | 2011

Archipelago: A polymorphic cache design for enabling robust near-threshold operation

Amin Ansari; Shuguang Feng; Shantanu Gupta; Scott A. Mahlke

Extreme technology integration in the sub-micron regime comes with a rapid rise in heat dissipation and power density for modern processors. Dynamic voltage scaling is a widely used technique to tackle this problem when high performance is not the main concern. However, the minimum achievable supply voltage for the processor is often bounded by the large on-chip caches since SRAM cells fail at a significantly faster rate than logic cells when reducing supply voltage. This is mainly due to the higher susceptibility of the SRAM structures to process-induced parameter variations. In this work, we propose a highly flexible fault-tolerant cache design, Archipelago, that by reconfiguring its internal organization can efficiently tolerate the large number of SRAM failures that arise when operating in the near-threshold region. Archipelago partitions the cache to multiple autonomous islands with various sizes which can operate correctly without borrowing redundancy from each other. Our configuration algorithm — an adapted version of minimum clique covering — exploits the high degree of flexibility in the Archipelago architecture to reduce the granularity of redundancy replacement and minimize the amount of space lost in the cache when operating in near-threshold region. Using our approach, the operational voltage of a processor can be reduced to 375mV, which translates to 79% dynamic and 51% leakage power savings (in 90nm) for a microprocessor similar to the Alpha 21364. These power savings come with a 4.6% performance drop-off when operating in low power mode and 2% area overhead for the microprocessor.

high-performance computer architecture | 2014

Mosaic: Exploiting the spatial locality of process variation to reduce refresh energy in on-chip eDRAM modules

Aditya Agrawal; Amin Ansari; Josep Torrellas

EDRAM cells require periodic refresh, which ends up consuming substantial energy for large last-level caches. In practice, it is well known that different eDRAM cells can exhibit very different charge-retention properties. Unfortunately, current systems pessimistically assume worst-case retention times, and end up refreshing all the cells at a conservatively-high rate. In this paper, we propose an alternative approach. We use known facts about the factors that determine the retention properties of cells to build a new model of eDRAM retention times. The model is called Mosaic. The model shows that the retention times of cells in large eDRAM modules exhibit spatial correlation. Therefore, we logically divide the eDRAM module into regions or tiles, profile the retention properties of each tile, and program their refresh requirements in small counters in the cache controller. With this architecture, also called Mosaic, we refresh each tile at a different rate. The result is a 20x reduction in the number of refreshes in large eDRAM modules - practically eliminating refresh as a source of energy consumption.

international symposium on microarchitecture | 2011

Encore: low-cost, fine-grained transient fault recovery

Shuguang Feng; Shantanu Gupta; Amin Ansari; Scott A. Mahlke; David I. August

To meet an insatiable consumer demand for greater performance at less power, silicon technology has scaled to unprecedented dimensions. However, the pursuit of faster processors and longer battery life has come at the cost of reliability. Given the rise of processor reliability as a first-order design constraint, there has been a growing interest in low-cost, non-intrusive techniques for transient fault detection. Many of these recent proposals have counted on the availability of hardware recovery mechanisms. Although common in aggressive out-of-order cores, hardware support for speculative rollback and recovery is less common in lower-end commodity processors. This paper presents Encore, a software-based fault recovery mechanism tailored for these lower-cost systems that lack native hardware support for speculative rollback recovery. Encore combines program analysis, profile data, and simple code transformations to create statistically idempotent code regions that can recover from faults at very little cost. Using this software-only, compiler-based approach, Encore provides the ability to recover from transient faults without specialized hardware or the costs of traditional, full-system checkpointing solutions. Experimental results show that Encore, with just 14% of runtime overhead, can safely recover, on average from 97% of transient faults when coupled with existing detection schemes.

high-performance computer architecture | 2013

Refrint: Intelligent refresh to minimize power in on-chip multiprocessor cache hierarchies

Aditya Agrawal; Prabhat Jain; Amin Ansari; Josep Torrellas

As manycores use dynamic energy ever more efficiently, static power consumption becomes a major concern. In particular, in a large manycore running at a low voltage, leakage in on-chip memory modules contributes substantially to the chips power draw. This is unfortunate, given that, intuitively, the large multi-level cache hierarchy of a manycore is likely to contain a lot of useless data. An effective way to reduce this problem is to use a low-leakage technology such as embedded DRAM (eDRAM). However, eDRAM requires refresh. In this paper, we examine the opportunity of minimizing on-chip memory power further by intelligently refreshing on-chip eDRAM. We present Refrint, a simple approach to perform fine-grained, intelligent refresh of on-chip eDRAM multiprocessor cache hierarchies. We introduce the Refrint algorithms and microarchitecture. We evaluate Refrint in a simulated manycore running 16-threaded parallel applications. We show that an eDRAM-based memory hierarchy with Refrint consumes only 30% of the energy of a conventional SRAM-based memory hierarchy, and induces a slowdown of only 6%. In contrast, an eDRAM-based memory hierarchy without Refrint consumes 56% of the energy of the conventional memory hierarchy, inducing a slowdown of 25%.

high performance embedded architectures and compilers | 2010

Maestro: orchestrating lifetime reliability in chip multiprocessors

Shuguang Feng; Shantanu Gupta; Amin Ansari; Scott A. Mahlke

As CMOS feature sizes venture deep into the nanometer regime, wearout mechanisms including negative-bias temperature instability and time-dependent dielectric breakdown can severely reduce processor operating lifetimes and performance. This paper presents an introspective reliability management system, Maestro, to tackle reliability challenges in future chip multiprocessors (CMPs) head-on. Unlike traditional approaches, Maestro relies on low-level sensors to monitor the CMP as it ages (introspection). Leveraging this real-time assessment of CMP health, runtime heuristics identify wearout-centric job assignments (management). By exploiting the complementary effects of the natural heterogeneity (due to process variation and wearout) that exists in CMPs and the diversity found in system workloads, Maestro composes job schedules that intelligently control the aging process. Monte Carlo experiments show that Maestro significantly enhances lifetime reliability through intelligent wear-leveling, increasing the expected service life of a population of 16-core CMPs by as much as 38% compared to a naive, round-robin scheduler. Furthermore, in the presence of process variation, Maestros wearout-centric scheduling outperformed both performance counter and temperature sensor based schedulers, achieving an order of magnitude more improvement in lifetime throughput – the amount of useful work done by a system prior to failure.

IEEE Transactions on Computers | 2011

StageNet: A Reconfigurable Fabric for Constructing Dependable CMPs

Shantanu Gupta; Shuguang Feng; Amin Ansari; Scott A. Mahlke

CMOS scaling has long been a source of dramatic performance gains. However, semiconductor feature size reduction has resulted in increasing levels of operating temperatures and current densities. Given that most wearout mechanisms are highly dependent on these parameters, significantly higher failure rates are projected for future technology generations. Consequently, fault tolerance, which has traditionally been a subject of interest for high-end server markets, is now getting emphasis in the mainstream computing systems space. The popular solution for this has been the use of redundancy at a coarse granularity, such as dual/triple modular redundancy. In this work, we challenge the practice of coarse-granularity redundancy by identifying its inability to scale to high failure rate scenarios and investigating the advantages of finer-grained configurations. To this end, this paper presents and evaluates a highly reconfigurable CMP architecture, named as StageNet (SN), that is designed with reliability as its first-class design criteria. SN relies on a reconfigurable network of replicated processor pipeline stages to maximize the useful lifetime of a chip, gracefully degrading performance toward the end of life. Our results show that the proposed SN architecture can perform 40 percent more cumulative work compared to a traditional CMP over 12 years of its lifetime.

high-performance computer architecture | 2014

Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks

Amin Ansari; Asit K. Mishra; Jianping Xu; Josep Torrellas

On-chip networks are especially vulnerable to within-die parameter variations. Since they connect distant parts of the chip, they need to be designed to work under the most unfavorable parameter values in the chip. This results in energy-inefficient designs. To improve the energy efficiency of on-chip networks, this paper presents a novel approach that relies on monitoring the errors of messages as they traverse the network. Based on the observed errors of messages, the system dynamically decreases or increases the voltage (Vdd) of groups of network routers. With this approach, called Tangle, the different Vdd values applied to different groups of network routers progressively converge to their lowest, variation-aware, error-free values - always keeping the network frequency unchanged. This saves substantial network energy. In a simulated 64-router network with 4 Vdd domains, Tangle reduces the network energy consumption by an average of 22% with negligible performance impact. In a future network design with one Vdd domain per router, Tangle lowers the network Vdd by an average of 21%, reducing the network energy consumption by an average of 28% with negligible performance impact.

Explore More