Ali Shafiee | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ali Shafiee is active.

Explore More

Publication

Featured researches published by Ali Shafiee.

international symposium on computer architecture | 2016

ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars

Ali Shafiee; Anirban Nag; Naveen Muralimanohar; Rajeev Balasubramonian; John Paul Strachan; Miao Hu; R. Stanley Williams; Vivek Srikumar

A number of recent efforts have attempted to design accelerators for popular machine learning algorithms, such as those involving convolutional and deep neural networks (CNNs and DNNs). These algorithms typically involve a large number of multiply-accumulate (dot-product) operations. A recent project, DaDianNao, adopts a near data processing approach, where a specialized neural functional unit performs all the digital arithmetic operations and receives input weights from adjacent eDRAM banks. This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner. While the use of crossbar memory as an analog dot-product engine is well known, no prior work has designed or characterized a full-fledged accelerator based on crossbars. In particular, our work makes the following contributions: (i) We design a pipelined architecture, with some crossbars dedicated for each neural network layer, and eDRAM buffers that aggregate data between pipeline stages. (ii) We define new data encoding techniques that are amenable to analog computations and that can reduce the high overheads of analog-to-digital conversion (ADC). (iii) We define the many supporting digital components required in an analog CNN accelerator and carry out a design space exploration to identify the best balance of memristor storage/compute, ADCs, and eDRAM storage on a chip. On a suite of CNN and DNN workloads, the proposed ISAAC architecture yields improvements of 14.8×, 5.5×, and 7.5× in throughput, energy, and computational density (respectively), relative to the state-of-the-art DaDianNao architecture.

high-performance computer architecture | 2014

MemZip: Exploring unconventional benefits from memory compression

Ali Shafiee; Meysam Taassori; Rajeev Balasubramonian; Al Davis

Memory compression has been proposed and deployed in the past to grow the capacity of a memory system and reduce page fault rates. Compression also has secondary benefits: it can reduce energy and bandwidth demands. However, most prior mechanisms have been designed to focus on the capacity metric and few prior works have attempted to explicitly reduce energy or bandwidth. Further, mechanisms that focus on the capacity metric also require complex logic to locate the requested data in memory. In this paper, we design a highly simple compressed memory architecture that does not target the capacity metric. Instead, it focuses on complexity, energy, bandwidth, and reliability. It relies on rank subsetting and a careful placement of compressed data and metadata to achieve these benefits. Further, the space made available via compression is used to boost other metrics - the space can be used to implement stronger error correction codes or energy-efficient data encodings. The best performing MemZip configuration yields a 45% performance improvement and 57% memory energy reduction, compared to an uncompressed non-sub-ranked baseline. Another energy-optimized configuration yields a 29.8% performance improvement and a 79% memory energy reduction, relative to the same baseline.

design, automation, and test in europe | 2012

AFRA: a low cost high performance reliable routing for 3D mesh NoCs

Sara Akbari; Ali Shafiee; Mahmoud Fathy; Reza Berangi

Three-dimensional network-on-chips are suitable communication fabrics for high-density 3D many-core ICs. Such networks have shorter communication hop count, compared to 2D NoCs, and enjoy fast and power efficient TSV wires in vertical links. Unfortunately, the fabrication process of TSV connections has not matured yet, which results in poor vertical links yield. In this work, we address this challenge and introduce AFRA, a deadlock-free routing algorithm for 3D mesh-based NoCs that tolerates faults on vertical links. AFRA is designed to be simple, high performance, and robust. The simplicity is achieved by applying ZXY and XZXY routings in the absence and presence of fault, respectively. Furthermore, AFRA, as will be proved, is deadlock-free when all vertical faulty links have the same direction. This enables the routing to save virtual channels for performance rather than scarifying them for deadlock avoidance. Finally, AFRA provides robustness, which means supporting connection for all possible pairs of communicating nodes in high fault rates. AFRA is evaluated, though cycle accurate network simulation, and is compared with planar adaptive routing. Results reveal that AFRA significantly outperforms planar adaptive routing in both synthetic and real traffic patterns. In addition, the robustness of AFRA is calculated analytically.

international conference on computer design | 2011

A morphable phase change memory architecture considering frequent zero values

Mohammad Arjomand; Amin Jadidi; Ali Shafiee; Hamid Sarbazi-Azad

Phase Change Memory (PCM) is emerging as a high-dense and power-efficient choice for future main memory systems. While PCM cell size is marching towards minimum achievable feature size, recent prototypes effectively improve device scalability by storing multiple bits per each cell. Unfortunately, Multi-Level Cell (MLC) PCM devices offer higher access time and energy when compared to Single-Level Cell (SLC) counterparts making it difficult to incorporate MLC in main memory. To address this challenge, we proposes Zero-value-based Morphable PCM, ZM-PCM for short, a novel MLC-PCM main memory architecture which tries incorporating benefits of both MLC and SLC devices within the same structure. ZM-PCM relies on the observation that zero value at various granularities is frequently occurred within main memory transactions when running PARSEC-2 programs. Motivated by this observation, ZM-PCM codes redundant zero MLC cells into limited bits that is storable in the SLC (or alternatively in devices with fewer bits) form with improved latency, energy, and lifetime with no reduction in available main memory capacity. We evaluate microarchitecture design of morphable PCM cell, coding and decoding algorithms and details of related circuits. We also introduce a simple area-efficient caching mechanism for fast cost-efficient access to coding metadata. Our evaluation on a quad-core CMP with 4GB 8-bit MLC PCM main memory shows that ZM-PCM morphs up to 93% (and 50% on average) of all memory cells with lower densities which directly turns in performance, power and lifetime enhancement.

international symposium on microarchitecture | 2015

Avoiding information leakage in the memory controller with fixed service policies

Ali Shafiee; Akhila Gundu; Manjunath Shevgoor; Rajeev Balasubramonian; Mohit Tiwari

Trusted applications frequently execute in tandem with untrusted applications on personal devices and in cloud environments. Since these co-scheduled applications share hardware resources, the latencies encountered by the untrusted application betray information about whether the trusted applications are accessing shared resources or not. Prior studies have shown that such information leaks can be used by the untrusted application to decipher keys or launch covert-channel attacks. Prior work has also proposed techniques to eliminate information leakage in various shared resources. The best known solution to eliminate information leakage in the memory system incurs high performance penalties. This work develops a comprehensive approach to eliminate timing channels in the memory controller that has two key elements: (i) We shape the memory access behavior of each thread so that it has an unchanging memory access pattern. (ii) We show how efficient memory access pipelines can be constructed to process the resulting memory accesses without introducing any resource conflicts. We mathematically show that the proposed system yields zero information leakage. We then show that various page mapping policies can impact the throughput of our secure memory system. We also introduce techniques to re-order requests from different threads to boost performance without leaking information. Our best solution offers throughput that is 27% lower than that of an optimized non-secure baseline, and that is 69% higher than the best known competing scheme.

international conference on computer aided design | 2011

Application-aware deadlock-free oblivious routing based on extended turn-model

Ali Shafiee; Mahdy Zolghadr; Mohammad Arjomand; Hamid Sarbazi-Azad

Programmable hardware is gaining popularity as it can keep pace with growing performance demand in tight power budget, design and test cost, and serious reliability concerns of future multiprocessor embedded systems. Compatible with this trend, Network-on-Chip, as a potential bottleneck of future multi-cores, should also support pro-grammability. Here, we address this issue in design and implementation of routing algorithm for two-dimensional mesh. To this end, we allocate paths based on input traffic pattern and in parallel with customizing routing restriction for deadlock freedom. To achieve this, we propose extended turn model (ETM), a novel parametric deadlock-free routing for 2D meshes that generalize prior turn-based routing methods (e.g., odd-even) with great degree of freedoms. This model facilitates design of Mixed-Integer Linear Programming (MILP) approach, which considers channel dependency turns as independent variables and decides for both path allocation and routing restriction. We solve this problem by genetic algorithm and evaluate it using simulation experiments. Results reveal that application-aware ETM-based path allocation outperforms prior turn-based approaches under synthetic and real traffic loads.

hardware and architectural support for security and privacy | 2014

Memory bandwidth reservation in the cloud to avoid information leakage in the memory controller

Akhila Gundu; Gita Sreekumar; Ali Shafiee; Seth H. Pugsley; Hardik Jain; Rajeev Balasubramonian; Mohit Tiwari

Multiple virtual machines (VMs) are typically co-scheduled on cloud servers. Each VM experiences different latencies when accessing shared resources, based on contention from other VMs. This introduces timing channels between VMs that can be exploited to launch attacks by an untrusted VM. This paper focuses on trying to eliminate the timing channel in the shared memory system. Unlike prior work that implements temporal partitioning, this paper proposes and evaluates bandwidth reservation. We show that while temporal partitioning can degrade performance by 61% in an 8-core platform, bandwidth reservation only degrades performance by under 1% on average.

international symposium on computer architecture | 2010

Using partial tag comparison in low-power snoop-based chip multiprocessors

Ali Shafiee; Narges Shahidi; Amirali Baniasadi

In this work we introduce power optimizations relying on partial tag comparison (PTC) in snoop-based chip multiprocessors. Our optimizations rely on the observation that detecting tag mismatches in a snoop-based chip multiprocessor does not require aggressively processing the entire tag. In fact, a high percentage of cache mismatches could be detected by utilizing a small subset but highly informative portion of the tag bits. Based on this, we introduce a source-based snoop filtering mechanism referred to as S-PTC. In S-PTC possible remote tag mismatches are detected prior to sending the request. We reduce power as S-PTC prevents sending unnecessary snoops and avoids unessential tag lookups at the end-points. Furthermore, S-PTC improves performance as a result of early cache miss detection. S-PTC improves average performance from 2.9% to 3.5% for different configurations and for the SPLASH-2 benchmarks used in this study. Our solutions reduce snoop request bandwidth from 78.5% to 81.9% and average tag array dynamic power by about 52%.

international conference on computer design | 2016

Enabling technologies for memory compression: Metadata, mapping, and prediction

Arjun Deb; Paolo Faraboschi; Ali Shafiee; Naveen Muralimanohar; Rajeev Balasubramonian; Robert Schreiber

Future systems dealing with big-data workloads will be severely constrained by the high performance and energy penalty imposed by data movement. This penalty can be reduced by storing datasets in DRAM or NVM main memory in compressed formats. Prior compressed memory systems have required significant changes to the operating system, thus limiting commercial viability. The first contribution of this paper is to integrate compression metadata with ECC metadata so that the compressed memory system can be implemented entirely in hardware with no OS involvement. We show that in such a system, read operations are unable to exploit the benefits of compression because the compressibility of the block is not known beforehand. To address this problem, we introduce a compressibility predictor that yields an accuracy of 97%. We also introduce a new data mapping policy that is able to maximize read/write parallelism and NVM endurance, when dealing with compressed blocks. Combined, our proposals are able to eliminate OS involvement and improve performance by 7% (DRAM) and 8% (NVM), and system energy by 12% (DRAM) and 14% (NVM), relative to an uncompressed memory system.

international conference on computer design | 2010

Helia: Heterogeneous Interconnect for Low Resolution Cache Access in snoop-based chip multiprocessors

Ali Shafiee; Narges Shahidi; Amirali Baniasadi

In this work we introduce Heterogeneous Interconnect for Low Resolution Cache Access (Helia). Helia improves energy efficiency in snoop-based chip multiprocessors as it eliminates unnecessary activities in both interconnect and cache. This is achieved by using innovative snoop filtering mechanisms coupled with wire management techniques. Our optimizations rely on the observation that a high percentage of cache mismatches could be detected by utilizing a small subset but highly informative portion of the tag bits. Helia relies on the snoop controller to detect possible remote tag mismatches prior to tag array lookup. Power is reduced as a) our wire management techniques permit slow transmission of a subset of tag bits while tag mismatches are being detected and b) we avoid cache access for mismatches detected at the snoop controller. Our Evaluation shows that Helia reduces power in interconnect (dynamic: 64% to 75%, static: 45% to 50%) and cache tag array (dynamic: 57% to 58%, static: 80%) while improving average performance up to 4.4%.

Explore More