Monther Abusultan
Texas A&M University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Monther Abusultan.
ieee computer society annual symposium on vlsi | 2016
Monther Abusultan; Sunil P. Khatri
This paper presents a method to use floating gate (flash) transistors to implement low power ternary-valued digital circuits targeting handheld and IoT devices. Since the threshold voltage of flash devices can be modified at a fine granularity during programming, our approach has several advantages. For one, speed binning at the factory can be controlled with precision. Secondly, an IC can be re-programmed in the field, to negate effects such as aging, which has been a significant problem in recent times, particularly for mission-critical applications. We present the circuit topology that we use in our flash-based ternary-valued logic digital circuit approach, and, through circuit simulations, show that our approach yields significantly improved power (~11%), energy (~29%) and area (~83%) characteristics while operating at a clock rate that is 36% as compared to a traditional CMOS standard cell based approach, when averaged over 20 designs. Unlike CMOS, our ternary-valued, flash-based implementation provides in-field configuration flexibility.
great lakes symposium on vlsi | 2014
Monther Abusultan; Sunil P. Khatri
The FinFET device has gained much traction in recent VLSI designs. In the FinFET device, the conduction channel is vertical, unlike a traditional bulk MOSFET, in which the conduction channel is planar. This yields several benefits, and as a consequence, it is expected that most VLSI designs will utilize FinFETs from the 20nm node and beyond. Despite the fact that several research papers have reported FinFET based circuit and layout realizations for popular circuit blocks, there has been no reported work on the use of FinFETs for Field Programmable Gate Array (FPGA) designs. The key circuit in the FPGA that enables programmability is the n-input Look-up Table (LUT). An n-input LUT can implement any logic function of up to n inputs. In this paper, we present an evaluation of several FPGA LUT designs. We compare these designs from a performance (delay, power, energy) as well as an area perspective. Comparisons are conducted with respect to a bulk based LUT as well. Our results demonstrate that all the FinFET based LUTs exhibit better delays and energy than the bulk based LUT. Based on our comparisons, we have two winning candidate LUTs, one for high performance designs (3X faster than a bulk based LUT) and another for low energy, area constrained designs (83% energy and 58% area compared to a bulk based LUT).
international conference on computer design | 2016
Monther Abusultan; Sunil P. Khatri
Floating gate (flash) transistors are used exclusively for memory applications today. These applications include SD cards of various form factors, USB flash drives and SSDs. This paper presents the first approach to use flash transistors to implement binary-valued digital circuits. Since the threshold voltage of flash devices can be modified at a fine granularity during programming, several advantages are obtained by our approach. For one, speed binning at the factory can be controlled with precision. Secondly, an IC can be re-programmed in the field, to negate effects such as aging, which has been a significant problem in recent times, particularly for mission-critical applications. We present the circuit topology that we use in our flash-based digital circuit approach, and, through circuit simulations, show that our approach yields significantly improved delay (0.84×), power (0.35×), energy (0.30×) and area (0.54×) characteristics compared to a traditional CMOS standard cell based approach, when averaged over 20 randomly generated designs. Note that we used the same operating voltage (1V) for both design styles. Our proposed circuit design style is not an FPGA, because it uses hardwired interconnect. Rather, our design approach is a method to design ASIC or custom/semi-custom digital circuits.
international conference on computer design | 2016
Abbas Fairouz; Monther Abusultan; Sunil P. Khatri
Historically, microprocessor instructions were designed in order to obtain high performance on integer and floating point computations. Todays applications, however, demand high performance for cloud computing, web-based search engines, network applications, and social media tasks. Such software applications involve an extensive use of hashing in their computation. Hashing can reduce the complexity of search and lookup from O(n) to O(n/k), where k bins are used. In modern microprocessors hashing is done in software. In this paper, we propose a novel hardware hash unit design for use in modern microprocessors. We present the design of the Hash Unit (HU) at the micro-architecture level. We simulate the new HU to compare its performance with a software-based hash implementation. We demonstrate a significant speed-up (up to 12×) for the HU. Furthermore, the performance scales elegantly with increasing database size and application diversity, without increasing the hardware cost.
international conference on computer aided design | 2016
Monther Abusultan; Sunil P. Khatri
Traditionally, floating gate (flash) transistors have been used exclusively to implement non-volatile memory in its various forms. Recently, we showed that flash transistors can be used to implement digital circuits as well. In this paper, we present the details on the realization and characteristics of the block-level flash-based digital design. The current work describes the synthesis flow to decompose a circuit block into a network of interconnected FCs. The resulting network is characterized with respect to timing, power and energy, and the results are compared with a standard-cell based realization of the same block (obtained using commercial tools). We obtain significantly improved delay (0.59×), power (0.35×) and cell area (0.60×) compared to a traditional CMOS standard-cell based approach, when averaged over 12 standard benchmarks. It is generally rare that a circuit methodology yields results that are better than existing commercial standard-cell based flows in terms of delay, area, power and energy, and in this sense, we submit that our results are significant. Additional benefits of a flash-based digital design is that it allows for precision speed binning in the factory, and also enables in-field re-programmability (we note that our flash-based design is not an FPGA, but rather an ASIC style design) to counteract the speed degradation of a design due to aging. These benefits arise from the fact that the threshold voltage of flash devices can be controlled with precision.
great lakes symposium on vlsi | 2017
Abbas Fairouz; Monther Abusultan; Sunil P. Khatri
Modern microprocessors contain several Special Function Units (SFUs) such as specialized arithmetic units, cryptographic processors, etc. In recent times, applications such as cloud computing, web-based search engines, and network applications are widely used, and place new demands on the microprocessor. Hashing is a key algorithm that is extensively used in such applications. Hashing is typically performed in software. Thus, implementing a hardware-based hash unit on a modern microprocessor would potentially increase performance significantly. In this paper, we present the circuit design for a hardware hash unit (HU) for modern microprocessors, using a 45nm technology. Our proposed hardware hash unit is based on the use of a CAM to implement each bin of the hash function. We simulate the HU circuit and compare it with a traditional CAM design. We demonstrate an average power reduction of 5.48x using the HU over the traditional CAM. Also, we show that the HU can operate at a maximum frequency of 1.39 GHz (after accounting for process, voltage and temperature (PVT) variations and accounting for wiring parasitics). Furthermore, we present the delay, power and area trade-offs of the HU design with varying hash table sizes.
international conference on computer design | 2014
Viacheslav V. Fedorov; Monther Abusultan; Sunil P. Khatri
This paper presents a Ternary Content-addressable Memory (TCAM) design which is based on the use of floating-gate (flash) transistors. TCAMs are extensively used in high speed IP networking, and are commonly found in routers in the internet core. Traditional TCAM ICs are built using CMOS devices, and a single TCAM cell utilizes 17 transistors. In contrast, our TCAM cell utilizes only 2 flash transistors, thereby significantly reducing circuit area. We cover the chip-level architecture of the TCAM IC briefly, focusing mainly on the TCAM block which does fast parallel IP routing table lookup. Our flash based TCAM block is simulated in SPICE, and we show that it has a significantly lowered area compared to a CMOS based TCAM block, with a speed that can meet current (~400 Gb/s) data rates that are found in the internet core.
great lakes symposium on vlsi | 2017
Monther Abusultan; Sunil P. Khatri
Flash transistors serve as the technology of choice for implementing nonvolatile memory. Current flash memory densities are increasing to meet storage demands. One of the key features that has resulted in increasing flash memory densities is the ability to program a flash transistor to have multiple threshold voltages. This feature has recently been exploited to implement ternary-valued logic. However, this implementation exhibited increased delays, due to the lowered Vgs values which result from using multiple threshold voltages. In this work, we present a circuit implementation that uses flash transistors to implement multi-valued digital circuits. The flash transistors used in our implementation only need two threshold voltages. As a result they have high Vgs values which improves delays significantly, and have higher write endurance as well. We evaluate our design methodology through circuit simulations, and compare our results to a CMOS standard cell based approach as well as the previously reported implementation of ternary-valued logic using flash transistors. Averaged over 20 designs, we report improvements in delay (23% lower), power (5% lower), energy (26% lower) and physical area (4% lower) compared to a CMOS standard cell based implementation. This is significant since it is hard for a new circuit approach to beat the established standard cell based approach in all the figures of merit (delay, area, power and energy. Compared to the ternary flash-based implementation, we operate at 3.15x faster than the ternary flash-based designs, while consuming more power and energy. The design approach presented in this paper targets high performance applications, unlike the TLC-based approach which targets low power and low speed applications. Also, the proposed approach scales elegantly to multi-valued logic using more than three values as well.
international conference on computer design | 2016
Monther Abusultan; Sunil P. Khatri
Field programmable gate arrays (FPGAs) are the implementation platform of choice when it comes to design flexibility. However, SRAM-based FPGAs suffer from high power consumption, prolonged boot delays (due to the volatility of the configuration bits), and a significant area overhead (due to the use of 5T SRAM cells for the configuration bits). Floating gate (flash) based FPGAs can avert these problems. This paper presents a study of flash-based FPGA designs (both static and dynamic), and presents the tradeoff of delay, power dissipation and energy consumption of the various designs. Our work differs from previously proposed flash-based FPGAs, since we embed the flash transistors (which store the configuration bits) directly within the logic and interconnect fabrics. We also present a detailed description of how the programming of the configuration bits is accomplished. Our delay and power estimates are derived from circuit level simulations. Our proposed static flash-based LUT structure yields 10% faster operation, 12% lower dynamic power dissipation, 21% lower energy consumption and 29% lower static power dissipation compared to a traditional SRAM-based LUT. We also show that, for high performance applications, a dynamic flash-based LUT can achieve further performance improvements (32% lower delay) with higher energy consumption (37% higher) compared to an SRAM-based LUT. We also show that a flash-based interconnect structure provides 89% lower delay and 71% lower overall power consumption compared to the traditional interconnect structure used in SRAM-based FPGAs.
great lakes symposium on vlsi | 2015
Monther Abusultan; Sunil P. Khatri
In this paper, we present a circuit-level analysis of deep voltage-scaled FPGAs, which operate from full supply to sub-threshold voltages. The logic as well as the interconnect of the FPGA are modeled at the circuit level, and their relative contribution to the delay, power and energy of the FPGA are studied by means of circuit simulations. Three representative designs are studied to explore these design trade-offs. We conclude that the energy and delay-minimal FPGA design is one in which both the interconnect and logic are curtailed from scaling below a fixed voltage (about 550mV in our experiments). If power is a more important design factor (at the cost of delay), it is beneficial to operate both the logic and interconnect between 300mV and 800mV.