Ameer M. S. Abdelhadi
University of British Columbia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ameer M. S. Abdelhadi.
great lakes symposium on vlsi | 2010
Ameer M. S. Abdelhadi; Ran Ginosar; Avinoam Kolodny; Eby G. Friedman
Clock skew variations adversely affect timing margins, limiting performance, reducing yield, and may also lead to functional faults. Non-tree clock distribution networks, such as meshes and crosslinks, are employed to reduce skew and also to mitigate skew variations. However, these networks incur an increase in dissipated power while consuming significant metal resources. Several methods have been proposed to trade off power and wires to reduce skew. In this paper, an efficient algorithm is presented to reduce skew variations rather than skew, and prioritize the algorithm for critical timing paths, since these paths are more sensitive to skew variations. The algorithm has been implemented for a standard 65 nm cell library using standard EDA tools, and has been tested on several benchmark circuits. As compared to other methods, experimental results show a 37% average reduction in metal consumption and 39% average reduction in power dissipation, while insignificantly increasing the maximum skew.
field programmable gate arrays | 2014
Ameer M. S. Abdelhadi; Guy Lemieux
Multi-ported RAMs are essential for high-performance parallel computation systems. VLIW and vector processors, CGRAs, DSPs, CMPs and other processing systems often rely upon multi-ported memories for parallel access, hence higher performance. Although memories with a large number of read and write ports are important, their high implementation cost means they are used sparingly in designs. As a result, FPGA vendors only provide dual-ported block RAMs to handle the majority of usage patterns. In this paper, a novel and modular approach is proposed to construct multi-ported memories out of basic dual-ported RAM blocks. Like other multi-ported RAM designs, each write port uses a different RAM bank and each read port uses bank replication. The main contribution of this work is an optimization that merges the previous live-value-table (LVT) and XOR approaches into a common design that uses a generalized, simpler structure we call an invalidation-based live-value-table (I-LVT). Like a regular LVT, the I-LVT determines the correct bank to read from, but it differs in how updates to the table are made; the LVT approach requires multiple write ports, often leading to an area-intensive register-based implementation, while the XOR approach uses wider memories to accommodate the XOR-ed data and suffers from lower clock speeds. Two specific I-LVT implementations are proposed and evaluated, binary and one-hot coding. The I-LVT approach is especially suitable for larger multi-ported RAMs because the table is implemented only in SRAM cells. The I-LVT method gives higher performance while occupying less block RAMs than earlier approaches: for several configurations, the suggested method reduces the block RAM usage by over 44% and improves clock speed by over 76%. To assist others, we are releasing our fully parameterized Verilog implementation as an open source hardware library. The library has been extensively tested using ModelSim and Alteras Quartus tools.
field-programmable technology | 2012
Alexander Brant; Ameer M. S. Abdelhadi; Aaron Severance; Guy Lemieux
FPGAs are increasingly being used to implement many new applications, including pipelined processor designs. Designers often employ memories to communicate and pass data between these pipeline stages. However, one-cycle communication between sender and receiver is often required. To implement this read-immediately-after-write functionality, bypass registers are needed by most FPGA memory blocks. Read and write latencies to these memories and the bypass can limit clock frequencies, or require extra resources to further pipeline the bypass. Instead of further pipelining the bypass, this paper applies clock skew scheduling to memory write and read ports of a simple bypass circuit. We show that the clock skew provides an improved Fmax without requiring the area overhead of the pipelined bypass. Many configurations of pipelined memory systems are implemented, and their speed and area compared to our design. Memory clock skew scheduling yields the best Fmax of all techniques which preserve functionality, an improvement of 56% over the baseline clock speed, and 14% over the best conventional design. Furthermore, the suggested technique consumes 46% fewer resources than the next best performing technique.
field-programmable custom computing machines | 2015
Ameer M. S. Abdelhadi; Guy Lemieux
Binary Content Addressable Memories (BCAMs), also known as associative memories, are hardware-based search engines. BCAMs employ a massively parallel exhaustive search of the entire memory space, and are capable of matching a specific data within a single cycle. Networking, memory management, pattern matching, data compression, DSP, and other applications utilize CAMs as single-cycle associative search accelerators. Due to the increasing amount of processed information, modern BCAM applications demand a deep searching space. However, traditional BCAM approaches in FPGAs suffer from storage inefficiency. In this paper, a novel, efficient and modular technique for constructing BCAMs out of standard SRAM blocks in FPGAs is proposed. Hierarchical search is employed to achieve high storage efficiency. Previous hierarchical search approaches cannot be cascaded since they provide a single matching address, this incurs an exponential increase of RAM consumption as pattern width increases. Our approach, however, efficiently regenerates a match indicator for every single address by storing indirect indices for address match indicators. Hence, the proposed method can be cascaded and exponential growth is alleviated into linear. Our method exhibits high storage efficiency and is capable of implementing up to 9 times wider BCAMs compared to other approaches. A fully parameterized Verilog implementation is being released as an open source library. The library has been extensively tested using Alteras Quartus and Model Sim.
ACM Transactions on Reconfigurable Technology and Systems | 2016
Ameer M. S. Abdelhadi; Guy Lemieux
Multiported RAMs are essential for high-performance parallel computation systems. VLIW and vector processors, CGRAs, DSPs, CMPs, and other processing systems often rely upon multiported memories for parallel access. Although memories with a large number of read and write ports are important, their high implementation cost means that they are used sparingly. As a result, FPGA vendors only provide dual-ported block RAMs (BRAMs) to handle the majority of usage patterns. Furthermore, recent attempts to create FPGA-based multiported memories suffer from low storage utilization. Whereas most approaches provide simple unidirectional ports with a fixed read or write, others propose true bidirectional ports where each port dynamically switches read and write. True RAM ports are useful for systems with transceivers and provide high RAM flexibility; however, this flexibility incurs high BRAM consumption. In this article, a novel, modular, and BRAM-based switched multiported RAM architecture is proposed. In addition to unidirectional ports with fixed read/write, this switched architecture allows a group of write ports to switch with another group of read ports dynamically, hence altering the number of active ports. The proposed switched-ports architecture is less flexible than a true-multiported RAM where each port is switched individually. Nevertheless, switched memories can dramatically reduce BRAM consumption compared to true ports for systems with alternating port requirements. Previous live-value-table (LVT) and XOR approaches are merged and optimized into a generalized and modular structure that we call an invalidation-based live-value-table (I-LVT). Like a regular LVT, the I-LVT determines the correct bank to read from, but it differs in how updates to the table are made; the LVT approach requires multiple write ports, often leading to an area-intensive register-based implementation, whereas the XOR approach suffers from excessive storage overhead since wider memories are required to accommodate the XOR-ed data. Two specific I-LVT implementations are proposed and evaluated: binary and thermometer coding. The I-LVT approach is especially suitable for deep memories because the table is implemented only in SRAM cells. The I-LVT method gives higher performance while occupying fewer BRAMs than earlier approaches: for several configurations, BRAM usage is reduced by greater than 44% and clock speed is improved by greater than 76%. The I-LVT can be used with fixed ports, true ports, or the proposed switched ports architectures. Formal proofs for the suggested methods, resources consumption analysis, usage guidelines, and analytic comparison to other methods are provided. A fully parameterized Verilog implementation is released as an open source library. The library has been extensively tested using Altera’s EDA tools.
field-programmable technology | 2014
Ameer M. S. Abdelhadi; Guy Lemieux
Binary Content Addressable Memories (BCAMs) are massively parallel search engines capable of searching the entire memory space in a single clock cycle. BCAMs are used in a wide range of applications, such as memory management, networks, data compression, DSP, and databases. Due to the increasing amount of processed information, modern BCAM applications demand a deep searching space. However, traditional BCAM approaches in FPGAs suffer from storage inefficiency. In this paper, a novel and efficient technique for constructing deep and narrow BCAMs out of standard SRAM blocks in FPGAs is proposed. This technique is most efficient for deep and narrow CAMs since the BRAM consumption is exponential to pattern width. Using Alteras Stratix V device, traditional methods achieve up to 64K-entry BCAM while the proposed technique achieves up to 4M entries. For the 64K-entry test-case, traditional methods consume 43 times more ALMs and achieves only one-third of the Fmax. A fully parameterized Verilog implementation is available1. This implementation has been extensively tested using Alteras tools.
reconfigurable computing and fpgas | 2011
Ameer M. S. Abdelhadi; Guy Lemieux
SRAM-based Field-Programmable Gate Arrays (FPGAs) are configured from off-chip memory through a serial link. Hence, a large configuration bit stream adversely increases off-chip memory size as well as bit stream loading time. The following work proposes a novel method to reduce the number of programming bits required for look-up tables (LUT), thereby reducing overall configuration bit stream size. Alternatively, the identified redundancy may be used to hide watermarking or security data. The proposed method does not affect the critical timing paths, nor does it affect the internal architecture of the LUT. The suggested method eliminates floor(log2(k!)) configuration bits out of the 2^k configuration bits required by a k-input LUT (k-LUT). Hence, a 4-LUT, 5-LUT and 6-LUT only requires 12, 26, and 55 bits, respectively, to be stored in the external configuration bit stream, representing a reduction of 25%, 18.75%, and 14% in LUT configuration bits, respectively. Note the LUTs themselves still contain the full 16, 32, and 64 bits, respectively, but the missing bits are regenerated at bit stream load time. Furthermore, traditional loss less compression methods can still be employed on top of the proposed reduction technique.
ieee international symposium on asynchronous circuits and systems | 2017
Ameer M. S. Abdelhadi; Mark R. Greenstreet
This paper presents a family of FIFOs for clock-domain crossings. These designs are distinguished by an interleaved architecture for the control and data-paths. This approach eliminates most of the throughput bottlenecks in the FIFO design, allowing operation at well over 1GHz in a 65nm process using a standard ASIC design flow. Furthermore, these designs are low-latency: the fall-through time for an empty FIFO is only a few gate delays greater than the synchronizer latency. Our designs are fully synthesizable using widely available design libraries. Furthermore, we identify a glitch vulnerability that is lurking in many published designs, and describe our solutions to these hazards.
Archive | 2016
Ameer M. S. Abdelhadi
Since they were first introduced three decades ago, Field-Programmable Gate Arrays (FPGAs) have evolved from being merely used as glue-logic to implementing entire compute accelerators. These massively parallel systems demand highly parallel memory structures to keep pace with their concurrent nature since memories are usually the bottleneck of computation performance. However, the vast majority of FPGA devices provide dual-ported SRAM blocks only. In this dissertation, we propose new ways to build area-efficient, high-performance SRAM-based parallel memory structures in FPGAs, specifically Multi-Ported Random Access Memory and Content-Addressable Memory (CAM). While parallel computation demands more RAM ports, leading Multi-Ported Random Access Memory techniques in FPGAs have relatively large overhead in resource usage. As a result, we have produced new design techniques that are near-optimal in resource overhead and have several practical advantages. The suggested method reduces RAM usage by over 44% and improves clock speed by over 76% compared to the best of previous approaches. Furthermore, we propose a novel switched-ports technique that allows further area reduction if some RAM
Integration | 2013
Ameer M. S. Abdelhadi; Ran Ginosar; Avinoam Kolodny; Eby G. Friedman
Clock skew variations adversely affect timing margins, limiting performance, reducing yield, and may also lead to functional faults. Non-tree clock distribution networks, such as meshes and crosslinks, are employed to reduce skew and also to mitigate skew variations. These networks, however, increase the dissipated power while consuming significant metal resources. Several methods have been proposed to trade off power and wires to reduce skew. In this paper, an efficient algorithm is presented to reduce clock skew variations while minimizing power dissipation and metal area overhead. With a combination of nonuniform meshes and unbuffered trees (UBT), a variation-tolerant hybrid clock distribution network is produced. Clock skew variations are selectively reduced based on circuit timing information generated by static timing analysis (STA). The skew variation reduction procedure is prioritized for critical timing paths, since these paths are more sensitive to skew variations. A framework for skew variation management is proposed. The algorithm has been implemented in a standard 65nm cell library using standard EDA tools, and tested on several benchmark circuits. As compared to other nonuniform mesh construction methods that do not support managed skew tolerance, experimental results exhibit a 41% average reduction in metal area and a 43% average reduction in power dissipation. As compared to other methods that employ skew tolerance management techniques but do not use a hybrid clock topology, an 8% average reduction in metal area and a 9% average reduction in power dissipation are achieved.