Khalid H. Abed
Jackson State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Khalid H. Abed.
IEEE Transactions on Signal Processing | 2006
Shailesh B. Nerurkar; Khalid H. Abed
This correspondence presents a new design of a low-power decimation filter, which consists of a poly-phase finite-impulse-response (FIR)-4 comb filter, an approximate linear-phase one-third-band infinite-impulse-response (IIR) filter, and a half-band FIR filter. A poly-phase FIR-4 comb filter architecture was designed, which consumes 75% less power than the recursive comb filter architecture. New general equations were derived for the design of 1/N-band IIR filters, and these equations were used to design a novel approximate linear-phase one-third-band IIR filter. The decimation filter is designed using Simulink and DSP Blockset, and Matlab simulations are performed to verify the phase linearity and magnitude response of the designed filter. The group delay for the passband region of one-third-band IIR filter has a negligible error of 0.009%. The resulting decimation filter has 60% less hardware and consumes 67% less power than the comb-FIR-FIR decimation filter.
Journal of Computers | 2009
Justin L. Rice; Khalid H. Abed; Gerald R. Morris
Because of the increasing need to develop efficient high-speed computational kernels, researchers have been looking at various acceleration technologies. One approach is to use field programmable gate arrays (FPGAs) in conjunction with general purpose processors to form what are known as high performance reconfigurable computers (HPRCs). HPRCs have already been shown to work well for both fixed-point and integer calculations. Floating-point calculations are a different matter; obtaining speedups has been somewhat elusive. This article, after introducing the three primary HPRC development flows, takes a detailed look at “the three p’s,” which addresses the crucial relationship among performance, pipelining, and parallelism. It also examines “the FPGA design boundary,” which addresses some of the heuristics that allow developers to determine which application modules can be mapped onto the FPGAs. These ideas are illustrated by way of a simple floating-point application that is mapped onto a contemporary HPRC. This article expands upon earlier work by including details on how to map customized intellectual property cores into an HPRC environment via a hybrid development flow.
IEEE Transactions on Parallel and Distributed Systems | 2013
Gerald R. Morris; Khalid H. Abed
High-performance heterogeneous computers that employ field programmable gate arrays (FPGAs) as computational elements are known as high-performance reconfigurable computers (HPRCs). For floating-point applications, these FPGA-based processors must satisfy a variety of heuristics and rules of thumb to achieve a speedup compared with their software counterparts. By way of a simple sparse matrix Jacobi iterative solver, this paper illustrates some of the issues associated with mapping floating-point kernels onto HPRCs. The Jacobi method was chosen based on heuristics developed from earlier research. Furthermore, Jacobi is relatively easy to understand, yet is complex enough to illustrate the mapping issues. This paper is not trying to demonstrate the speedup of a particular application nor is it suggesting that Jacobi is the best way to solve equations. The results demonstrate a nearly threefold wall clock runtime speedup when compared with a software implementation. A formal analysis shows that these results are reasonable. The purpose of this paper is to illuminate the challenging floating-point mapping process while simultaneously showing that such mappings can result in significant speedups. The ideas revealed by research such as this have already been and should continue to be used to facilitate a more automated mapping process.
ieee international conference on high performance computing data and analytics | 2009
Khalid H. Abed; Gerald R. Morris
Parallel codes with large-stride/irregular-stride (L/I) memory access patterns, e.g., sparse matrix and linked list codes, often perform poorly on mainstream clusters because of the general purpose processor (GPP) memory hierarchy. High performance reconfigurable computers (HPRCs) are parallel computing clusters containing multiple GPPs and field programmable gate arrays (FPGAs) connected via a high-speed network. In this research, simple 64-bit floating-point parallel codes are used to illustrate the performance impact of L/I memory accesses in software (SW) and FPGA-augmented (FA) codes and to assess the benefits of mapping L/I-type codes onto HPRCs. The experiments reveal that large-stride SW codes, particularly those involving data reuse, experience severe performance degradation compared with unit-stride SW codes. In contrast, large-stride FA codes experience minimal degradation compared with unit-stride FA codes. More importantly, for codes that involve data reuse, the experiments demonstrate performance improvements of up to nearly tenfold for large-stride FA codes compared with large-stride SW codes.
southeastcon | 2011
Nikeya S. Peay; Gerald R. Morris; Khalid H. Abed
One of the newest computational technologies is the high performance heterogeneous computer (HPHC) wherein dissimilar computational devices such as general purpose processors, graphics processors, field programmable gate arrays (FPGAs), etc., are used within a single platform to obtain a computational speedup. Jackson State University has a state-of-art HPHC cluster (an SRC-7), which contains traditional CPUs and reconfigurable processing units. The reconfigurable units are implemented using SRAM-based FPGAs. Currently, the off-the-shelf SRC-7 mechanism for incorporating user components (macros) does not directly support the common case of a multiple file VHDL hierarchy. This research explores a novel approach that allows multiple file VHDL floating-point kernels to be mapped onto the SRC-7. The approach facilitates the development of FPGA-based components via a hybrid technique that uses the SRC Carte compiler in conjunction with multiple file VHDL-based user macros. This research shows how Quartus Wizard-based VHDL floating-point components can be integrated into the Carte development environment.
ieee international conference on high performance computing data and analytics | 2010
Gerald R. Morris; Ricky Y. McGruder; Khalid H. Abed
High performance reconfigurable computers (HPRCs), which combine general-purpose processors (GPPs) and field programmable gate arrays (FPGAs), are now commercially available. These interesting architectures allow for the creation of reconfigurable processors. HPRCs have already been used to accelerate integer and fixed-point applications. However, extensive parallelism and deeply pipelined floating-point cores are necessary to make MHz-scale FPGAs competitive with GHz-scale GPPs, thus making it difficult to accelerate certain kinds of floating-point kernels. Kernels with variable length nested loops, e.g., sparse matrix-vector multiply, have been problematic because of the loop-carried dependence associated with the pipelined floating-point units. While hardware description language (HDL)-based kernels have shown moderate success in addressing this problem, the use of a high-level language (HLL)-based approach to accelerate such applications has been rather elusive. If HPRCs are to become a part of mainstream military and scientific computing, we should emphasize the use of HLL-based programming, whenever possible, rather than HDL-based hardware design. The primary reason is the increased programmer productivity associated with HLLs when compared with HDLs. For example, the floating-point addition statement z = x+y, a single line in an HLL, corresponds to hundreds of lines of HDL. In this paper, we describe the design and implementation of a sparse matrix Jacobi processor to solve systems of linear equations, Ax=b. The parallelized, deeply pipelined, IEEE-754-compliant 32-bit floating-point sparse matrix Jacobi iterative solver runs on a contemporary HPRC. The FPGA-based components are implemented using only an HLL (the C programming language) and the Carte HLL-to-HDL compiler. An HLL-based streaming accumulator allows for the implementation of fully pipelined loops and results in a 2.5-fold wall clock runtime speedup when compared with an equivalent software-only implementation.
international conference on electronics, circuits, and systems | 2007
Khalid H. Abed; Shailesh B. Nerurkar; Stephen Colaco
In this paper, we deal with the design and practical implementation of a decimation filter used for high performance audio applications. We implemented the decimation filter using the canonic signed digit (CSD) representation. The decimation filter was simulated using Matlab, and its complete architecture was realized using DSP Blockset and Simulink. The filter was implemented using Mentor Graphic ModelSim and Calibre Tool in FPGA technology. The resulting architecture is hardware efficient and consumes less power compared to conventional decimation filters. Compared to the comb-FIR-FIR-FIR architecture, the designed decimation filter architecture contributes to a hardware saving of 69 %; in addition, it reduces the power dissipation by 28 %, respectively.
Journal of Computers | 2013
Gerald R. Morris; Khalid H. Abed
Contemporary field programmable gate arrays (FPGAs) combine the fine-grained design capability of the traditional lookup table with the speed of medium-scale and large-scale logic components such as RAM blocks or DSP blocks to provide for significant computational capability from a single FPGA. High performance reconfigurable computers, which typically use FPGAs as computational elements, have been commercially used to accelerate computational kernels. However, the deep pipelines and extensive parallelism needed for FPGAs to compete with GHz-scale general purpose processors make mapping of floating-point kernels a challenging research area. In this paper, we describe some of the progress that has been made towards solving some of these mapping challenges.
southeastcon | 2011
Antoinette R. Anderson; Gerald R. Morris; Khalid H. Abed
As Reconfigurable Computing (RC) closes its sixth decade, significant improvements have been made to make this technology a competitor for application-specific integrated circuits (ASICs). With the field programmable gate array (FPGA) computing power operating significantly lower in speed than that of a general purpose processor (GPP), the developer must exploit every avenue possible to attain a speedup on a heterogeneous computer. Achieveing a significant speedup is what makes the RC application development process worthwhile. The developer may reap the benefits of having better computational power at a lower cost than using a traditional ASIC. This occurs primarily through efforts to pipeline and parallelize processes on an FPGA. In addition to the traditional “three Ps,” 1 this paper highlights another speedup avenue via true multilevel parallelism. In particular, it further demonstrates this concept by using a threaded programming model that allows for the GPP and the FPGA to run simultaneously. This method is realized through a threaded dot product on a heterogeneous computer.
Journal of Circuits, Systems, and Computers | 2009
Shailesh B. Nerurkar; Khalid H. Abed
This paper presents a design of a novel cascaded third-order feed-forward delta-sigma analog-to-digital converter (ADC). This ADC is realized using fully differential switched capacitor architecture and produces a 12-bit resolution at a data output rate (DOR) of 2.5 MS/s for RF wireless applications. The delta-sigma modulator consists of a second-order single-bit feed-forward modulator cascaded with a multi-bit first-order modulator. The cascaded feed-forward third-order (2-1) ADC is simulated using Matlab and Simulink. The delta-sigma modulator was designed using Cadence Virtuoso in TSMC 0.18 μm CMOS technology. The power consumption of the designed modulator is 12.74 mW, and the resolution is 11.85 bits for an over-sampling ratio (M = 32). The figure of merit is 1.38 pJ at a sample rate of 80 MS/s. The proposed delta-sigma modulator is compared with other state-of-the-art low-pass delta-sigma modulators in terms of their speed, power, DOR, and the proposed modulator has one of the lowest power consumption.