Mohammad Arjomand | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mohammad Arjomand is active.

Explore More

Publication

Featured researches published by Mohammad Arjomand.

international symposium on low power electronics and design | 2011

High-endurance and performance-efficient design of hybrid cache architectures through adaptive line replacement

Amin Jadidi; Mohammad Arjomand; Hamid Sarbazi-Azad

In this paper, we propose a run-time strategy for managing writes onto last level cache in chip multiprocessors where STT-RAM memory is used as baseline technology. To this end, we assume that each cache set is decomposed into limited SRAM lines and large number of STT-RAM lines. SRAM lines are target of frequently-written data and rarely-written or read-only ones are pushed into STT-RAM. As a novel contribution, a low-overhead, fully-hardware technique is utilized to detect write-intensive data blocks of working set and place them into SRAM lines while the remaining data blocks are candidates to be remapped onto STT-RAM blocks during system operation. Therefore, the achieved cache architecture has large capacity and consumes near zero leakage energy using STT-RAM array; while dynamic write energy, acceptable write latency, and long lifetime is guaranteed via SRAM array. Results of full-system simulation for a quad-core CMP running PARSEC-2 benchmark suit confirm an average of 49 times improvement in cache lifetime and more than 50% reduction in cache power consumption when compared to baseline configurations.

design, automation, and test in europe | 2009

A hybrid packet-circuit switched on-chip network based on SDM

Mehdi Modarressi; Hamid Sarbazi-Azad; Mohammad Arjomand

In this paper, we propose a novel on-chip communication scheme by dividing the resources of a traditional packet-switched network-on-chip between a packet-switched and a circuit-switched sub-network. The former directs packets according to the traditional packet-switching mechanism, while the latter forwards packets over circuits which are directly established between two non-adjacent nodes by bypassing the intermediate routers. A packet may switch between the sub-networks several times to reach its destination. The circuits are set up using a low-latency and low-cost setup-network. The network resources are split between the two sub-networks using Spatial-Division Multiplexing (SDM). The work aims to improve the power and performance metrics of Network-on-Chip (NoC) architectures and benefits from the power and scalability advantage of packet-switched NoCs and superior communication performance of circuit-switching. The evaluation results show a significant reduction in power and latency over a traditional packet-switched NoC.

international conference on computer design | 2011

A morphable phase change memory architecture considering frequent zero values

Mohammad Arjomand; Amin Jadidi; Ali Shafiee; Hamid Sarbazi-Azad

Phase Change Memory (PCM) is emerging as a high-dense and power-efficient choice for future main memory systems. While PCM cell size is marching towards minimum achievable feature size, recent prototypes effectively improve device scalability by storing multiple bits per each cell. Unfortunately, Multi-Level Cell (MLC) PCM devices offer higher access time and energy when compared to Single-Level Cell (SLC) counterparts making it difficult to incorporate MLC in main memory. To address this challenge, we proposes Zero-value-based Morphable PCM, ZM-PCM for short, a novel MLC-PCM main memory architecture which tries incorporating benefits of both MLC and SLC devices within the same structure. ZM-PCM relies on the observation that zero value at various granularities is frequently occurred within main memory transactions when running PARSEC-2 programs. Motivated by this observation, ZM-PCM codes redundant zero MLC cells into limited bits that is storable in the SLC (or alternatively in devices with fewer bits) form with improved latency, energy, and lifetime with no reduction in available main memory capacity. We evaluate microarchitecture design of morphable PCM cell, coding and decoding algorithms and details of related circuits. We also introduce a simple area-efficient caching mechanism for fast cost-efficient access to coding metadata. Our evaluation on a quad-core CMP with 4GB 8-bit MLC PCM main memory shows that ZM-PCM morphs up to 93% (and 50% on average) of all memory cells with lower densities which directly turns in performance, power and lifetime enhancement.

international symposium on computer architecture | 2014

Reducing access latency of MLC PCMs through line striping

Morteza Hoseinzadeh; Mohammad Arjomand; Hamid Sarbazi-Azad

Although phase change memory with multi-bit storage capability (known as MLC PCM) offers a good combination of high bit-density and non-volatility, its performance is severely impacted by the increased read/write latency. Regarding read operation, access latency increases almost linearly with respect to cell density (the number of bits stored in a cell). Since reads are latency critical, they can seriously impact system performance. This paper alleviates the problem of slow reads in the MLC PCM by exploiting a fundamental property of MLC devices: the Most-Significant Bit (MSB) of MLC cells can be read as fast as SLC cells, while reading the Least-Significant Bits (LSBs) is slower. We propose Striped PCM (SPCM), a memory architecture that leverages this property to keep MLC read latency in the order of SLCs. In order to avoid extra writes onto memory cells as a result of striping memory lines, the proposed design uses a pairing write queue to synchronize write-back requests associated with blocks that are paired in striping mode. Our evaluation shows that our design significantly improves the average memory access latency by more than 30% and IPC by up to 25% (10%, on average), with a slight overhead in memory energy (0.7%) in a 4-core CMP model running memory-intensive benchmarks.

international conference on vlsi design | 2010

Voltage-Frequency Planning for Thermal-Aware, Low-Power Design of Regular 3-D NoCs

Mohammad Arjomand; Hamid Sarbazi-Azad

Network-on-Chip combined with Globally Asynchronous Locally Synchronous paradigm is a promising architecture for easy IP integration and utilization with multiple voltage levels. For power reduction, multiple voltage-frequency levels are successfully applied to 2-D NoCs, but never with a generic approach to 3-D counterparts; in which low heat conductivity of insulator layers makes high dense temperature distribution at layers away from heat sink. In this paper, a thermal-aware methodology for regular 3-D NoCs based on multiple voltage levels is proposed. Given an application task graph, this methodology determines an efficient mapping of tasks onto network tiles, considering inherent computation and communication requirements of the tasks and thermal resistance from any silicon layer to the ambient. Then, a heuristic approach is utilized to determine voltage and frequency specifications of all IP cores, such that total power is reduced, dissipated heat is properly conducted to the layers close to the heat sink, and application requirements (in terms of deadline) are satisfied. The experiments confirm a significant saving in total power while performance of the running application is guaranteed.

IEEE Transactions on Very Large Scale Integration Systems | 2016

Sequoia: A High-Endurance NVM-Based Cache Architecture

Mohammad Reza Akbari Jokar; Mohammad Arjomand; Hamid Sarbazi-Azad

Emerging nonvolatile memory technologies, such as spin-transfer torque RAM or resistive RAM, can increase the capacity of the last-level cache (LLC) in a latency and power-efficient manner. These technologies endure 109-1012 writes per cell, making a nonvolatile cache (NV-cache) with a lifetime of dozens of years under ideal working conditions. However, nonuniformity in writes to different cache lines considerably reduces the NV-cache lifetime to a few months. Writes to cache lines can be made uniformly by wear-leveling. A suitable wear-leveling for NV-cache should not incur high storage and performance overheads. We propose a novel, simple, and effective wear-leveling technique with negligible performance overhead of <;0.4% for memory-intensive workloads. Our proposal consists of two mechanisms: 1) a wear-leveling mechanism within each cache set that slightly increases main memory write-back traffic and LLC miss rate and 2) a novel technique to reduce cache interset variation which causes minimum interference with normal cache operation. Using these mechanisms, we show that the lifetime of the NV-cache is boosted up to 13× for different cache configurations.

design automation conference | 2014

An Efficient STT-RAM Last Level Cache Architecture for GPUs

Mohammad Hossein Samavatian; Hamed Abbasitabar; Mohammad Arjomand; Hamid Sarbazi-Azad

In this paper, having investigated the behavior of GPGPU applications, we present an efficient L2 cache architecture for GPUs based on STT-RAM technology. With the increase of processing cores count, larger on-chip memories are required. Due to its high density and low power characteristics, STT-RAM technology can be utilized in GPUs where numerous cores leave a limited area for on-chip memory banks. They have however two important issues, high energy and latency of write operations, that have to be addressed. Low data retention time STT-RAMs can reduce the energy and delay of write operations. However, employing STT-RAMs with low retention time in GPUs requires a thorough investigation on the behavior of GPGPU applications based on which the STT-RAM based L2 cache is architectured. The STT-RAM L2 cache architecture proposed in this paper, can improve IPC by more than 100% (16% on average) while reducing the average consumed power by 20% compared to a conventional L2 cache architecture with equal on-chip area.

dependable systems and networks | 2014

A Reliable 3D MLC PCM Architecture with Resistance Drift Predictor

Majid Jalili; Mohammad Arjomand; Hamid Sarbazi Azad

In this paper, we study the problem of resistance drift in an MLC Phase Change Memory (PCM) and propose a solution to circumvent its thermally-affected accelerated rate in 3D CMPs. Our scheme is based on the observation that instead of alleviating the problem of resistance drift by using large margins or error correction codes, the PCM read circuit can be reconfigured for tolerating most of the resistance drift errors in a dynamic manner. Through detailed characterization of memory access patterns for 22 applications, we propose an efficient mechanism to facilitate such reliable read scheme via tolerating (a) early-cycle resistance drifts by using narrow margins so that considerably saving energy of writes and improving cell endurance, and (b) late-cycle resistance drifts by accurately estimating resistance thresholds that separate states for sensing. Evaluations on a true 3D architecture, consisting of a 4-core CMP and a banked 2-bit PCM memory, show that our proposal provides 106 × lower error rate compared to the state-of-the-art designs of PCMs.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2010

Power-Performance Analysis of Networks-on-Chip With Arbitrary Buffer Allocation Schemes

Mohammad Arjomand; Hamid Sarbazi-Azad

End-to-end delay, throughput, energy consumption, and silicon area are the most important design metrics of networks-on-chip (NoCs). Although several analytical models have been previously proposed for predicting such metrics in NoCs, very few of them consider the effect of message waiting time in the buffers of network routers for predicting overall power consumptions and none of them consider structural heterogeneity of network routers. This paper introduces two inter-related analytical models to compute message latency and power consumption of NoCs with arbitrary topology, buffering structure, and routing algorithm. Buffer allocation scheme defines the buffering space for each individual channel of the NoC that can be homogenous (all channels having similar buffer structures) or heterogeneous (each channel having its own buffer structure). Here, the buffer allocation scheme can be either homogenous or heterogeneous. We assume no bandwidth sharing of virtual channels for a physical channel, and IP cores generate messages following a Poisson distribution. The results obtained from simulation experiments confirm that the proposed models exhibit acceptable accuracy for different network configurations operating under various working conditions. We have shown that basing our analysis on a Poisson traffic model is still useful for scenarios with real application workloads.

international soc design conference | 2008

Performance evaluation of Butterfly on-Chip Network for MPSoCs

Mohammad Arjomand; Hamid Sarbazi-Azad

By Technology improvement, tens or hundreds of IP cores, operating complex functions with different frequencies, are mapped on-chip. This results in heterogeneous multiprocessor system-on-chip (MPSoC). The most MPSoC design challenges are due to infrastructure interconnect. Network-on-chip (NoC) with multiple constraints to be satisfied is a promising solution for these challenges. It has been shown that infrastructure topology, routing and switching schemes have great effects on overall interconnect performance under different synthesis and real life traffic patterns. In this paper, we evaluate Butterfly network with arbitrary extra stages as MPSoC infrastructure. Different routing and switching strategies are used for architectural consideration. Comparative analysis of results with common NoC infrastructures shows that in bandwidth requirement applications, Butterfly with extra stages and wormhole (and sometimes virtual cut through) switching can tolerate traffic, properly. As case studies, design space exploration including different topologies, routing and switching strategies for two video decoders are presented.

Explore More