Is this you? Create Your Porfile

Abbas Rahimi

University of California, Berkeley

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Abbas Rahimi is active.

Explore More

Publication

Featured researches published by Abbas Rahimi.

design, automation, and test in europe | 2011

A fully-synthesizable single-cycle interconnection network for Shared-L1 processor clusters

Abbas Rahimi; Igor Loi; Mohammad Reza Kakoee; Luca Benini

Shared L1 memory is an interesting architectural option for building tightly-coupled multi-core processor clusters. We designed a parametric, fully combinational Mesh-of-Trees (MoT) interconnection network to support high-performance, single-cycle communication between processors and memories in L1-coupled processor clusters. Our interconnect IP is described in synthesizable RTL and it is coupled with a design automation strategy mixing advanced synthesis and physical optimization to achieve optimal delay, power, area (DPA) under a wide range of design constraints. We explore DPA for a large set of network configurations in 65nm technology. Post placement&routing delay is 38FO4 for a configuration with 8 processors and 16 32-bit memories (8×16); when the number of both processors and memories is increased by a factor of 4, the delay increases almost logarithmically, to 84FO4, confirming scalability across a significant range of configurations. DPA tradeoff flexibility is also promising: in comparison to the maxperformance 16×32 configuration, there is potential to save power and area by 45% and 12 % respectively, at the expense of 30% performance degradation.

design, automation, and test in europe | 2016

Resistive configurable associative memory for approximate computing

Mohsen Imani; Abbas Rahimi; Tajana Simunic Rosing

Modern computing machines are increasingly characterized by large scale parallelism in hardware (such as GPGPUs) and advent of large scale and innovative memory blocks. Parallelism enables expanded performance tradeoffs whereas memories enable reuse of computational work. To be effective, however, one needs to ensure energy efficiency with minimal reuse overheads. In this paper, we describe a resistive configurable associative memory (ReCAM) that enables selective approximation and asymmetric voltage overscaling to manage delivered efficiency. The ReCAM structure matches an input pattern with pre-stored ones by applying an approximate search on selected bit indices (bitline-configurable) or selective pre-stored patterns (row-configurable). To further reduce energy, we explore proper ReCAM sizing, various configurable search operations with low overhead voltage overscaling, and different ReCAM update policies. Experimental result on the AMD Southern Islands GPUs for eight applications shows bitline-configurable and row-configurable ReCAM achieve on average to 43.6% and 44.5% energy savings with an acceptable quality loss of 10%.

design, automation, and test in europe | 2013

Hierarchically focused guardbanding: an adaptive approach to mitigate PVT variations and aging

Abbas Rahimi; Luca Benini; Rajesh K. Gupta

This paper proposes a new model of functional units for variation-induced timing errors due to PVT variations and device Aging (PVTA). The model takes into account PVTA parameter variations, clock frequency, and the physical details of Placed-and-Routed (P&R) functional units in 45nm TSMC analysis flow. Using this model and PVTA monitoring circuits, we propose Hierarchically Focused Guardbanding (HFG) as a method to adaptively mitigate PVTA variations. We demonstrate the effectiveness of HFG on GPU architecture at two granularities of observation and adaptation: (i) fine-grained instruction-level; and (ii) coarse-grained kernel-level. Using coarse-grained PVTA monitors with kernel-level adaptation, the throughput increases by 70% on average. By comparison, the instruction-by-instruction monitoring and adaptation enhances throughput by a factor of 1.8×–2.1× depending on the configuration of PVTA monitors and the type of instructions executed in the kernels.

design, automation, and test in europe | 2012

Analysis of instruction-level vulnerability to dynamic voltage and temperature variations

Abbas Rahimi; Luca Benini; Rajesh K. Gupta

Variation in performance and power across manufactured parts and their operating conditions is an accepted reality in aggressive CMOS processes. This paper considers challenges and opportunities in identifying this variation and methods to combat it for improved computing systems. We introduce the notion of instruction-level vulnerability (ILV) to expose variation and its effects to the software stack for use in architectural/compiler optimizations. To compute ILV, we quantify the effect of voltage and temperature variations on the performance and power of a 32-bit, RISC, in-order processor in 65 nm TSMC technology at the level of individual instructions. Results show 3.4 ns (68FO4) delay variation and 26.7x power variation among instructions, and across extreme corners. Our analysis shows that ILV is not uniform across the instruction set. In fact, ILV data partitions instructions into three equivalence classes. Based on this classification, we show how a low-overhead robustness enhancement techniques can be used to enhance performance by a factor of 1.1x-5.5x.

design, automation, and test in europe | 2015

Axilog: language support for approximate hardware design

Amir Yazdanbakhsh; Divya Mahajan; Bradley Thwaites; Jongse Park; Anandhavel Nagendrakumar; Sindhuja Sethuraman; Kartik Ramkrishnan; Nishanthi Ravindran; Rudra Jariwala; Abbas Rahimi; Hadi Esmaeilzadeh; Kia Bazargan

Relaxing the traditional abstraction of “near-perfect” accuracy in hardware design can lead to significant gains in energy efficiency, area, and performance. To exploit this opportunity, there is a need for design abstractions that can systematically incorporate approximation in hardware design. We introduce Axilog, a set of language annotations, that provides the necessary syntax and semantics for approximate hardware design and reuse in Verilog. Axilog enables the designer to relax the accuracy requirements in certain parts of the design, while keeping the critical parts strictly precise. Axilog is coupled with a Relaxability Inference Analysis that automatically infers the relaxable gates and connections from the designers annotations. The analysis provides formal safety guarantees that approximation will only affect the parts that the designer intended to approximate, referred to as relaxable elements. Finally, the paper describes a synthesis flow that approximates only the relaxable elements. Axilog enables applying approximation in the synthesis process while abstracting away the details of approximate synthesis from the designer. We evaluate Axilog, its analysis, and the synthesis flow using a diverse set of benchmark designs. The results show that the intuitive nature of the language extensions coupled with the automated analysis enables safe approximation of designs even with thousands of lines of code. Applying our approximate synthesis flow to these designs yields, on average, 54% energy savings and 1.9× area reduction with 10% output quality loss.

design, automation, and test in europe | 2015

Approximate associative memristive memory for energy-efficient GPUs

Abbas Rahimi; Amirali Ghofrani; Kwang-Ting Cheng; Luca Benini; Rajesh K. Gupta

Multimedia applications running on thousands of deep and wide pipelines working concurrently in GPUs have been an important target for power minimization both at the architectural and algorithmic levels. At the hardware level, energy-efficiency techniques that employ voltage overscaling face a barrier so-called “path walls”: reducing operating voltage beyond a certain point generates massive number of timing errors that are impractical to tolerate. We propose an architectural innovation, called A2M2 module (approximate associative memristive memory) that exhibits few tolerable timing errors suitable for GPU applications under voltage overscaling. A2M2 is integrated with every floating point unit (FPU), and performs partial functionality of the associated FPU by pre-storing high frequency patterns for computational reuse that avoids overhead due to re-execution. Voltage overscaled A2M2 is designed to match an input search pattern with any of the stored patterns within a Hamming distance range of 0-2. This matching behavior under voltage overscaling leads to a controllable approximate computing for multimedia applications. Our experimental results for the AMD Southern Islands GPU show that four image processing kernels tolerate the mismatches during pattern matching resulting in a PSNR ≥ 30dB. The A2M2 module with 8-row enables 28% voltage overscaling in 45nm technology resulting in 32% average energy saving for the kernels, while delivering an acceptable quality of service.

international symposium on low power electronics and design | 2014

Energy-efficient mapping of biomedical applications on domain-specific accelerator under process variation

Mohammad Khavari Tavana; Amey M. Kulkarni; Abbas Rahimi; Tinoosh Mohsenin; Houman Homayoun

The variability of deep-submicron technologies creates systems with asymmetric cores from a frequency and leakage power viewpoint, which makes an opportunity for performance-power optimization. In particular, process variation can transform a homogeneous many-core platform into a heterogeneous system where the task mapping becomes extremely difficult. In this paper, we propose a mapping algorithm that selects an appropriate task mapping along with voltage and frequency assignment for a cluster of cores. The mapping algorithm, which is based on simulated annealing, determines cluster voltages and core frequencies to minimize energy consumption and EDP under process variation. We examine the effectiveness of our proposed algorithm on a fully placed and routed 128-core biomedical accelerator in 45nm when running various applications including compressive sensing, seizure detection and ultrasound spectral Doppler and linear regression. The results indicate that exposing frequency and power variations to the mapping algorithm results in up to 22% (on average 11%) energy saving and 31% (on average 19%) EDP improvement.

IEEE Transactions on Computers | 2014

Application-Adaptive Guardbanding to Mitigate Static and Dynamic Variability

Abbas Rahimi; Luca Benini; Rajesh K. Gupta

Traditional application execution assumes an error-free execution hardware and environment. Such guarantees in execution are achieved by providing guardbands in the design of microelectronic processors. In reality, applications exhibit varying degrees of tolerance to error in computations. This paper proposes an adaptive guardbanding technique to combat CMOS variability for error-tolerant (probabilistic) applications as well as traditional error-intolerant applications. The proposed technique leverages a combination of accurate design time analysis and a minimally intrusive runtime technique to mitigate Process, Voltage, and Temperature (PVT) variations for a near-zero area overhead. We demonstrate our approach on a 32-bit in-order RISC processor with full post Placement and Routing (P&R) layout results in TSMC 45 nm technology. The adaptive guardbanding technique eliminates traditional guardbands on operating frequency using information from PVT variations and application-specific requirements on computational accuracy. For error-intolerant applications, we introduce the notion of Sequence-Level Vulnerability (SLV) that utilizes circuit-level vulnerability for constructing high-level software knowledge as metadata. In effect, the SLV metadata partitions sequences of integer SPARC instructions into two equivalence classes to enable the adaptive guardbanding technique to adapt the frequency simultaneously for dynamic voltage and temperature variations, as well as adapt to the different classes of the instruction sequences. The proposed technique achieves on an average 1.6 × speedup for error-intolerant applications compared to recent work . For probabilistic applications, the adaptive technique guarantees the error-free operation of a set of paths of the processor that always require correct timing (Vulnerable Paths) while reducing the cost of guardbanding for the rest of the paths (Invulnerable Paths). This increases the throughput of probabilistic applications upto 1.9 × over the traditional worst-case design. The proposed technique has 0.022% area overhead, and imposes only 0.034% and 0.031% total power overhead for intolerant and probabilistic applications respectively.

international symposium on low power electronics and design | 2016

ACAM: Approximate Computing Based on Adaptive Associative Memory with Online Learning

Mohsen Imani; Yeseong Kim; Abbas Rahimi; Tajana Simunic Rosing

The Internet of Things (IoT) dramatically increases the amount of data to be processed for many applications including multimedia. Unlike traditional computing environment, the workload of IoT significantly varies overtime. Thus, an efficient runtime profiling is required to extract highly frequent computations and pre-store them for memory-based computing. In this paper, we propose an approximate computing technique using a low-cost adaptive associative memory, named ACAM, which utilizes runtime learning and profiling. To recognize the temporal locality of data in real-world applications, our design exploits a reinforcement learning algorithm with a least recently use (LRU) strategy to select images to be profiled; the profiler is implemented using an approximate concurrent state machine. The profiling results are then stored into ACAM for computation reuse. Since the selected images represent the observed input dataset, we can avoid redundant computations thanks to high hit rates displayed in the associative memory. We evaluate ACAM on the recent AMD Southern Island GPU architecture, and the experimental results shows that the proposed design achieves by 34.7% energy saving for image processing applications with an acceptable quality of service (i.e., PSNR>30dB).

design automation conference | 2013

Aging-aware compiler-directed VLIW assignment for GPGPU architectures

Abbas Rahimi; Luca Benini; Rajesh K. Gupta

Negative bias temperature instability (NBTI) adversely affects the reliability of a processor by introducing new delay-induced faults. However, the effect of these delay variations is not uniformly spread across functional units and instructions: some are affected more (hence less reliable) than others. This paper proposes a NBTI-aware compiler-directed very long instruction word (VLIW) assignment scheme that uniformly distributes the stress of instructions with the aim of minimizing aging of GPGPU architecture without any performance penalty. The proposed solution is an entirely software technique based on static workload characterization and online execution with NBTI monitoring that equalizes the expected lifetime of each processing element by regenerating aging-aware healthy kernels that respond to the specific health state of GPGPU. We demonstrate our approach on AMD Evergreen architecture where iso-throughput executions of the healthy kernels reduce NBTI-induced voltage threshold shift up to 49% (11%) compared to naïve kernel executions, with (without) architectural support for power-gating. The kernel adaption flow takes average of 13 millisecond on a typical host machine thus making it suitable for practical implementation.

Explore More