Khaled E. Ahmed
Alexandria University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Khaled E. Ahmed.
reconfigurable computing and fpgas | 2014
Khaled E. Ahmed; Mohammed M. Farag
Intra-chip communication is a major bottleneck in modern multiprocessor system-on-chip (MPSoC) designs. The bus topology is the most common on-chip interconnect technology and bus contention in one of the major issues in bus-based MPSoC designs. Code division multiple access (CDMA) has been proposed as a bus sharing strategy to overcome the bus contention problem. In CDMA, a limited number of orthogonal spreading codes can share the medium due to the Multiple Access Interference (MAI) problem. In wireless communications, overloaded CDMA has been considered to increase the system capacity by adding extra non-orthogonal spreading codes with specific characteristics. We propose a novel CDMA bus architecture leveraging the overloaded CDMA concepts to increase the maximum number of cores sharing the same CDMA bus in MPSoC by 25% at a marginal cost. The overloaded CDMA bus architecture is illustrated, resource- and speed-efficient decoding circuits are presented, and a prototype system is implemented and validated on a Virtex-7 FPGA VC707 evaluation kit. The overloaded and ordinary CDMA bus architectures are compared in terms of resource usage, power consumption, bus operating clock frequency and bandwidth. Evaluation results show an improvement in resource utilization and power consumption per unit (IP core) and the bus bandwidth by approximately %25 while preserving the access delay of the ordinary CDMA bus.
international conference on electronics, circuits, and systems | 2015
Khaled E. Ahmed; Mohammed M. Farag
In this paper, we present a novel design of a dynamically configurable hardware accelerator for the new NIST SHA-3 standard, namely the Keccak hashing function. The SHA-3 accelerator is composed of a static datapath built based on two different folded architectures of the Keccak function and controlled by a programmable Finite State Machine (FSM) that can be dynamically configured at run-time to hash a message of arbitrary size and digest length. The proposed hardware architectures enable implementing all functions and modes of operation supported by the Keccak SHA-3 hashing standard. Two prototypes of the accelerator are developed and validated on a Xilinx Virtex-6 FPGA kit as a stand-alone system and on a ZedBoard kit featuring a ZynQ-7000 SoC FPGA, where the SHA-3 accelerator is implemented in the programmable logic and interfaced to an ARM Cortex-A9 processor. Hardware implementation is followed by a hardware/software co-design of a SHA-3 SoC running the keyed-Hash Message Authentication Code (HMAC) and Pseudo-Random Number Generator (PRNG) security applications. The ARM processor runs the application software and offloads SHA-3 computations to the hardware accelerator. The implementation results illustrate the performance enhancement of the SHA-3 SoC over pure software implementations in addition to the unprecedented flexibility offered by the proposed accelerators.
high performance interconnects | 2015
Khaled E. Ahmed; Mohammed M. Farag
On-chip interconnect is a major building block and a main performance bottleneck in modern complex System-on-Chips (SoCs). The bus topology and its derivatives are the most deployed communication architectures in contemporary SoCs. Space switching exemplified by cross bars and multiplexers, and time sharing are the key enablers of various bus architectures. The cross bar has quadratic complexity while resource sharing significantly degrades the overall systems performance. In this work we motivate using Code Division Multiple Access (CDMA) as a bus sharing strategy which offers many advantages over other topologies. Our work seeks to complement the conventional CDMA bus features by applying overloaded CDMA practices to increase the bus utilization efficiency. We propose the Difference-Overloaded CDMA Interconnect (D-OCI) bus that leverages the balancing property of the Walsh codes to increase the number of interconnected elements by 50%. Two implementations of the D-OCI bus optimized for both speed and resource utilization are presented. The bus operation is validated on a Xilinx Artix-7 AC701 FPGA kit and the bus performance is evaluated and compared to other existing bus topologies. We also present the synthesis results for the UMC-0.13 μm design kit to give an idea of the maximum achievable bus frequency on ASIC platforms. Moreover, we advance a proof-of-concept HLS implementation of the D-OCI bus on a Xilinx Zynq-7000 SoC and compare its performance, latency, and resource utilization to the ARM AXI bus. The performance evaluation demonstrates the superiority of the D-OCI bus.
international conference on electronics, circuits, and systems | 2015
Khaled E. Ahmed; Mohammed M. Farag
On-chip interconnects are the performance bottleneck in modern System-on-Chips (SoCs). Bus topologies and Networks-on-Chip (NoCs) are the main approaches used to implement on-chip communication. The interconnect fabric enables resource sharing by Time and/or Space Division Multiple Access (T/SDMA) techniques. Code Division Multiple Access (CDMA) has been proposed to enable resource sharing in on-chip interconnects where each data bit is spread by a unique orthogonal spreading code of length N. Unlike T/SDMA, in wireless CDMA, the communication channel capacity can be increased by overcoming the Multiple Access Interference (MAI) problem. In response, we present two overload CDMA interconnect (OCI) bus architectures, namely TDMA-OCI (T-OCI) and Parallel-OCI (P-OCI) to increase the classical CDMA interconnect capacity. We implement and validate the T-OCI and P-OCI bus topologies on the Xilinx Artix-7 AC701 kit. We compare the basic SDMA, TDMA, and CDMA buses and evaluate the OCI buses in terms of the resource utilization and bus bandwidth. The results show that the T-OCI achieve 100% higher bus capacity, 31% less resource utilization compared to the conventional CDMA bus topology. The P-OCI bus provides N times higher bus bandwidth compared to the T-OCI bus at the expense of increased resource utilization.
international workshop on signal processing advances in wireless communications | 2016
Mostafa Medra; Khaled E. Ahmed; Timothy N. Davidson
Ordered successive interference cancellation (OSIC) is a well-established detection strategy for multiple-input multiple-output (MIMO) systems. It offers good performance and has an efficient hardware implementation. Typically, the order in which the symbols are detected is based on the channel matrix and the noise statistics. In this paper, a modified ordering scheme that makes use of the information in the received signal is developed, and it is shown that this ordering provides a lower bit error rate. The proposed algorithm can be implemented in hardware as efficiently as the conventional OSIC algorithms and the results from a preliminary VLSI layout are included.
application-specific systems, architectures, and processors | 2016
Ahmed S. Eissa; Mahmoud A. Elmohr; Mostafa A. Saleh; Khaled E. Ahmed; Mohammed M. Farag
The Secure Hash Algorithm 3 (SHA-3) is a crypto-graphic hash function widely used in most security applications. The execution of the SHA-3 function is computationally intensive on lightweight embedded RISC processors. In this work, we advance a SHA-3 Instruction Set Extension (ISE) to improve its performance on a 32-bit MIPS processor. Two ISE approaches are proposed, namely native datapath and coprocessor-based ISEs. The ISE is developed with the aid of Codasip Studio, and the extended processor is implemented and benchmarked on a Xilinx Virtex-6-XC6VLX75t FPGA. The benchmarking results exhibit a 21% and 43% increase in the execution speed of the SHA-3 algorithm on the MIPS processor at the expense of 9% and 26% resource overheads for the native datapath and coprocessor-based ISEs, respectively.
topical conference on antennas and propagation in wireless communications | 2016
Khaled E. Ahmed; Kareem M. Attiah; Ahmed S. Eltrass
In this work, a multiple-target tracking problem for a Wi-Fi through wall system is formulated and a new Direction Of Arrival (DOA) angle estimation technique is investigated to solve the tracking problem in the presence of clutter. The DOA estimation from objects behind walls is investigated utilizing the MUltiple SIgnal Classification (MUSIC) algorithm compensated by Extended Kalman Particle Filtering (EKPF) technique for the first time. Simulation results show that the stand-alone MUSIC algorithm fails to identify two distinct objects having close DOAs and fails to track targets when they are moving close to each other. The results also reveal that the EKPF algorithm in conjunction with MIMO nulling technique correctly identifies close and overshadowed moving objects and improves the tracking success rate.
reconfigurable computing and fpgas | 2016
Khaled E. Ahmed; Mohamed R. M. Rizk; Mohammed M. Farag
Networks on Chip (NoCs) have replaced on-chip buses as the paramount communication strategy in large scale Systems-on-Chips (SoCs). Code Division Multiple Access (CDMA) has been proposed as an interconnect fabric that can achieve high throughput and fixed transfer latency due to the CDMA transmission concurrency. Overloaded CDMA Interconnect (OCI) is an architectural evolution of the conventional CDMA interconnects that can double their bandwidth at marginal cost. Employing OCI in CDMA-based NoCs has the potential of providing higher bandwidth at low-power and -area overheads compared to other NoC architectures. Furthermore, fixed latency and predictable performance achieved by the inherent CDMA concurrency can reduce the effort and overhead required to implement QoS. In this work, we advance the Overloaded CDMA interconnect for Network on Chip (OCNoC) dynamic central router. The OCNoC router leverages the overloaded CDMA concept to reduce the overall packet transfer latency and improve the network throughput at a negligible area overhead. Dynamic code assignment is adopted to reduce the decoding complexity and transfer latency and maximize the crossbar utilization. Two OCNoC solutions are advanced, serial and parallel CDMA encoding schemes. The OCNoC central routers are implemented and validated on a Virtex-7 VC709 FPGA kit. Evaluation results show a throughput enhancement up to 142% with a 1.7% variation in packet latencies. Synthesized using a 65 nm ASIC standard cell library, the presented ASIC OCNoC router requires 61% less area per processing element at 81.5% saving in energy dissipation compared to conventional CDMA-based NoCs.
international conference on microelectronics | 2016
Mahmoud A. Elmohr; Mostafa A. Saleh; Ahmed S. Eissa; Khaled E. Ahmed; Mohammed M. Farag
Secure Hash Algorithm 3 (SHA-3) based on the Keccak algorithm is the new standard cryptographic hash function announced by the National Institute of Standards and Technology (NIST). Hash functions are a ubiquitous computing tool that is commonly used in security, authentication, and many other applications. The calculation of SHA-3 is very computational-intensive limiting its applicability on RISc processors used in modern embedded systems and Systems on chips (Socs). In this work, we study the SHA-3 computation bottlenecks on a 32-bit RISC processor and introduce two Application Specific Instruction Set Processor (ASIP) architectures to speedup SHA-3 computation on the 32-bit MIPS processor. Two ASIP architectures namely native datapath and coprocessor-based ASIPs are developed with the aid of codasip Studio, implemented and evaluated on a Xilinx Virtex-6 FPGA. Compared to the reference SHA-3 execution on MIPS, the evaluation results show a 25% and 61.4% speedup for the native and coprocessor-based ASIPs at the expense of a 8.6% and 25.8% resource overheads, respectively.
international conference on microelectronics | 2016
Khaled E. Ahmed; Mohamed R. M. Rizk; Mohammed M. Farag
Code Division Multiple Access (CDMA) is proposed as the physical layer enabler of Network-On-Chip (NoC) interconnects for its prominent features such as fixed latency, guaranteed service, and reduced system complexity. CDMA interconnects have been adopted by the NoC community as it originates in wireless communications where each bit in a CDMA encoded data word is transmitted on a separate channel to avoid interference. However, the wireless interference problem can be efficiently mitigated in on-chip interconnects eliminating the need for replicating the CDMA channel. Moreover, wireless channels are sequential by nature which is not the case in on-chip interconnects where parallel buses are the default communication means. After CDMA was adopted by the NoC community, the same wireless CDMA scheme has been maintained where each data bit is encoded in a separate CDMA channel and the encoding/decoding logic is replicated for data packets. In this work, we present a novel CDMA encoding/decoding scheme called Aggregated CDMA (ACDMA) for NoC interconnects in which all packet bits are encoded in a single CDMA channel, consequently, eliminating the area and energy overheads resulted from replicating the channel encoding/decoding logic. The ACDMA NoC crossbar is synthesized on a 45-nm standard-cell process. Compared to the conventional CDMA NoC crossbars, the presented method achieves 60.5% less area, 55% less power consumption, and 124% more throughput per area ratio.