Is this you? Create Your Porfile

Mohsen Imani

University of California, San Diego

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mohsen Imani is active.

Explore More

Publication

Featured researches published by Mohsen Imani.

design, automation, and test in europe | 2016

Resistive configurable associative memory for approximate computing

Mohsen Imani; Abbas Rahimi; Tajana Simunic Rosing

Modern computing machines are increasingly characterized by large scale parallelism in hardware (such as GPGPUs) and advent of large scale and innovative memory blocks. Parallelism enables expanded performance tradeoffs whereas memories enable reuse of computational work. To be effective, however, one needs to ensure energy efficiency with minimal reuse overheads. In this paper, we describe a resistive configurable associative memory (ReCAM) that enables selective approximation and asymmetric voltage overscaling to manage delivered efficiency. The ReCAM structure matches an input pattern with pre-stored ones by applying an approximate search on selected bit indices (bitline-configurable) or selective pre-stored patterns (row-configurable). To further reduce energy, we explore proper ReCAM sizing, various configurable search operations with low overhead voltage overscaling, and different ReCAM update policies. Experimental result on the AMD Southern Islands GPUs for eight applications shows bitline-configurable and row-configurable ReCAM achieve on average to 43.6% and 44.5% energy savings with an acceptable quality loss of 10%.

international conference on computer aided design | 2015

CAUSE: Critical Application Usage-Aware Memory System using Non-volatile Memory for Mobile Devices

Yeseong Kim; Mohsen Imani; Shruti Patil; Tajana Simunic Rosing

Mobile devices are severely limited in memory, which affects critical user-experience metrics such as application service time. Emerging non-volatile memory (NVM) technologies such as STT-RAM and PCM are ideal candidates to provide higher memory capacity with negligible energy overhead. However, existing memory management systems overlook mobile users application usage which provides crucial cues for improving user experience. In this paper, we propose CAUSE, a novel memory system based on DRAM-NVM hybrid memory architecture. CAUSE takes explicit account of the application usage patterns to distinguish data criticality and identify suitable swap candidates. We also devise NVM hardware design optimized for the access characteristics of the swapped pages. We evaluate CAUSE on a real Android smartphone and NVSim simulator using user application usage logs. Our experimental results show that the proposed technique achieves 32% faster launch time for mobile applications while reducing energy cost by 90% and 44% on average over non-optimized STT-RAM and PCM, respectively.

international symposium on low power electronics and design | 2016

ACAM: Approximate Computing Based on Adaptive Associative Memory with Online Learning

Mohsen Imani; Yeseong Kim; Abbas Rahimi; Tajana Simunic Rosing

The Internet of Things (IoT) dramatically increases the amount of data to be processed for many applications including multimedia. Unlike traditional computing environment, the workload of IoT significantly varies overtime. Thus, an efficient runtime profiling is required to extract highly frequent computations and pre-store them for memory-based computing. In this paper, we propose an approximate computing technique using a low-cost adaptive associative memory, named ACAM, which utilizes runtime learning and profiling. To recognize the temporal locality of data in real-world applications, our design exploits a reinforcement learning algorithm with a least recently use (LRU) strategy to select images to be profiled; the profiler is implemented using an approximate concurrent state machine. The profiling results are then stored into ACAM for computation reuse. Since the selected images represent the observed input dataset, we can avoid redundant computations thanks to high hit rates displayed in the associative memory. We evaluate ACAM on the recent AMD Southern Island GPU architecture, and the experimental results shows that the proposed design achieves by 34.7% energy saving for image processing applications with an acceptable quality of service (i.e., PSNR>30dB).

IEEE Transactions on Emerging Topics in Computing | 2018

Approximate Computing Using Multiple-Access Single-Charge Associative Memory

Mohsen Imani; Shruti Patil; Tajana Simunic Rosing

Memory-based computing using associative memory is a promising way to reduce the energy consumption of important classes of streaming applications by avoiding redundant computations. A set of frequent patterns that represent basic functions are pre-stored in Ternary Content Addressable Memory (TCAM) and reused. The primary limitation to using associative memory in modern parallel processors is the large search energy required by TCAMs. In TCAMs, all rows that match, except hit rows, precharge and discharge for every search operation, resulting in high energy consumption. In this paper, we propose a new Multiple-Access Single-Charge (MASC) TCAM architecture which is capable of searching TCAM contents multiple times with only a single precharge cycle. In contrast to previous designs, the MASC TCAM keeps the match-line voltage of all miss-rows high and uses their charge for the next search operation, while only the hit rows discharge. We use periodic refresh to control the accuracy of the search. We also implement a new type of approximate associative memory by setting longer refresh times for MASC TCAMs, which yields search results within 1–2 bit Hamming distances of the exact value. To further decrease the energy consumption of MASC TCAM and reduce the area, we implement MASC with crossbar TCAMs. Our evaluation on AMD Southern Island GPU shows that using MASC (crossbar MASC) associative memory can improve the average floating point units energy efficiency by 33.4, 38.1, and 36.7 percent (37.7, 42.6, and 43.1 percent) for exact matching, selective 1-HD and 2-HD approximations respectively, providing an acceptable quality of service (PSNR > 30 dB and average relative error <10 percent). This shows that MASC (crossbar MASC) can achieve 1.77X (1.93X) higher energy savings as compared to the state of the art implementation of GPGPU that uses voltage overscaling on TCAM.

high-performance computer architecture | 2017

Exploring Hyperdimensional Associative Memory

Mohsen Imani; Abbas Rahimi; Deqian Kong; Tajana Simunic Rosing; Jan M. Rabaey

Brain-inspired hyperdimensional (HD) computing emulates cognition tasks by computing with hypervectors as an alternative to computing with numbers. At its very core, HD computing is about manipulating and comparing large patterns, stored in memory as hypervectors: the input symbols are mapped to a hypervector and an associative search is performed for reasoning and classification. For every classification event, an associative memory is in charge of finding the closest match between a set of learned hypervectors and a query hypervector by using a distance metric. Hypervectors with the i.i.d. components qualify a memory-centric architecture to tolerate massive number of errors, hence it eases cooperation of various methodological design approaches for boosting energy efficiency and scalability. This paper proposes architectural designs for hyperdimensional associative memory (HAM) to facilitate energy-efficient, fast, and scalable search operation using three widely-used design approaches. These HAM designs search for the nearest Hamming distance, and linearly scale with the number of dimensions in the hypervectors while exploring a large design space with orders of magnitude higher efficiency. First, we propose a digital CMOS-based HAM (D-HAM) that modularly scales to any dimension. Second, we propose a resistive HAM (R-HAM) that exploits timing discharge characteristic of nonvolatile resistive elements to approximately compute Hamming distances at a lower cost. Finally, we combine such resistive characteristic with a currentbased search method to design an analog HAM (A-HAM) that results in faster and denser alternative. Our experimental results show that R-HAM and A-HAM improve the energy-delay product by 9.6× and 1347× compared to D-HAM while maintaining a moderate accuracy of 94% in language recognition.

design, automation, and test in europe | 2016

MASC: Ultra-low energy multiple-access single-charge TCAM for approximate computing

Mohsen Imani; Shruti Patil; Tajana Simunic Rosing

Memory-based computing using associative memory has emerged as a promising solution to reduce the energy consumption of important classes of streaming applications such as multimedia by avoiding redundant computations. In associative memory, a set of frequent patterns that represent basic functions are pre-stored in ternary content addressable memory (TCAM) and reused. The primary limitation to using associative memory in modern parallel processors is the large search energy required by TCAMs. In TCAMs, all match rows, except hit rows, precharge and discharge in every search operation, resulting in high and undesirable energy consumption. In this paper, we propose a new multiple-access single-charge (MASC) TCAM architecture which is capable of searching TCAM contents multiple times with a single precharging cycle. In contrast to previous designs, the MASC TCAM keeps the match-line voltage of all miss-rows high and uses their charge for the next search operation, while only the hit rows discharge. We use a periodic refresh scheme to guarantee the accuracy of the search. We also implement a new type of approximate associative memory by setting longer refresh times for MASC TCAMs, which yields search results within 1-2 bit Hamming distances of the exact result. Our evaluation on AMD Southern Island GPU shows that using MASC associative memory can improve the average GPGPU energy efficiency by 36.6%, 40.2% and 39.4% for exact matching, selective 1-HD and 2-HD approximations respectively, with acceptable quality of service (PSNR>30dB). These energy savings are 1.8X and 1.6X higher than GPGPU using exact matching TCAM and approximation TCAM that uses voltage overscaling, respectively.

international symposium on nanoscale architectures | 2015

Hierarchical design of robust and low data dependent FinFET based SRAM array

Mohsen Imani; Shruti Patil; Tajana Simunic Rosing

This paper proposes a new FinFET based SRAM cell and a cache architecture that efficiently exploits our SRAM cell for low-power and robust memory design. Our cache architecture uses invert coding scheme to encode the input data of a word line by taking into account the data composition. Based on the new data distribution, we propose two new asymmetric SRAM cells (AABG and ADWL) utilizing adaptive back-gate feedback that significantly improve cache power consumption and reliability, and provide higher performance in state-of-the-art SRAM caches. The results show that the AABG cell is a good candidate for robust and low power caches, while the ADWL-based SRAM cache is low power and high performance cache. The simulations are performed on SPEC CPU 2006 benchmarks with GEM5 and HSPICE in 20nm independent gate FinFET technology. The results show that the proposed AABG (ADWL)-based cache improves static and dynamic power by at least 13% and 35% (17% and 12%) respectively, compared to other state-of-the-art cells, while guaranteeing 2.7X (1.98X) lower NBTI degradation with less than 1.5% area overhead.

IEEE Transactions on Emerging Topics in Computing | 2016

Resistive CAM Acceleration for Tunable Approximate Computing

Mohsen Imani; Daniel Peroni; Abbas Rahimi; Tajana Simunic Rosing

The Internet of Things significantly increases the amount of data generated, straining the processing capability of current computing systems. Approximate computing is a promising solution to accelerate computation by trading off energy and accuracy. In this paper, we propose a resistive content addressable memory (CAM) accelerator, called RCA, which exploits data locality to have an approximate memory-based computation. RCA stores high frequency patterns and performs computation inside CAM without using processing cores. During execution time, RCA searches an input operand among all prestored values on a CAM and returns the row with the nearest distance. To manage accuracy, we use a distance metric which considers the impact of each bit indices on computation accuracy. We evaluate an application of proposed RCA on CPU approximation, where RCA can be used as a stand-alone or as a hybrid computing unit besides CPU cores for tunable CPU approximation. We evaluate the architecture of the proposed RCA using HSPICE and multi2sim by testing our results on x86 CPU processor. Our evaluation shows that RCA can accelerate CPU computation by 12.6× and improve the energy efficiency by 6.6× as compared to a traditional CPU architecture, while providing acceptable quality of service.

design, automation, and test in europe | 2017

LookNN: Neural network with no multiplication

Mohammad Samragh Razlighi; Mohsen Imani; Farinaz Koushanfar; Tajana Simunic Rosing

Neural networks are machine learning models that have been successfully used in many applications. Due to the high computational complexity of neural networks, deploying such models on embedded devices with severe power/resource constraints is troublesome. Neural networks are inherently approximate and can be simplified. We propose LookNN, a methodology to replace floating-point multiplications with lookup table search. First, we devise an algorithmic solution to adapt conventional neural networks to LookNN such that the models accuracy is minimally affected. We provide experimental results and theoretical analysis demonstrating the applicability of the method. Next, we design enhanced general purpose processors for searching look-up tables: each processing element of our GPU has access to a small associative memory, enabling it to bypass redundant computations. Our evaluations on AMD Southern Island GPU architecture shows that LookNN results in 2.2x energy saving and 2.5x speedup running four different neural network applications with zero additive error. For the same four applications, if we tolerate an additive error of less than 0.2%, LookNN can achieve an average of 3x energy improvement and 2.6x speedup compared to the traditional GPU architecture.

design automation conference | 2017

Ultra-Efficient Processing In-Memory for Data Intensive Applications

Mohsen Imani; Saransh Gupta; Tajana Simunic Rosing

Recent years have witnessed a rapid growth in the domain of Internet of Things (IoT). This network of billions of devices generates and exchanges huge amount of data. The limited cache capacity and memory bandwidth make transferring and processing such data on traditional CPUs and GPUs highly inefficient, both in terms of energy consumption and delay. However, many IoT applications are statistical at heart and can accept a part of inaccuracy in their computation. This enables the designers to reduce complexity of processing by approximating the results for a desired accuracy. In this paper, we propose an ultra-efficient approximate processing in-memory architecture, called APIM, which exploits the analog characteristics of non-volatile memories to support addition and multiplication inside the crossbar memory, while storing the data. The proposed design eliminates the overhead involved in transferring data to processor by virtually bringing the processor inside memory. APIM dynamically configures the precision of computation for each application in order to tune the level of accuracy during runtime. Our experimental evaluation running six general OpenCL applications shows that the proposed design achieves up to 20× performance improvement and provides 480× improvement in energy-delay product, ensuring acceptable quality of service. In exact mode, it achieves 28× energy savings and 4.8× speed up compared to the state-of-the-art GPU cores.

Explore More