Mostafa I. Soliman | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mostafa I. Soliman is active.

Explore More

Publication

Featured researches published by Mostafa I. Soliman.

Journal of Parallel and Distributed Computing | 2011

FPGA implementation and performance evaluation of a high throughput crypto coprocessor

Mostafa I. Soliman; Ghada Y. Abozaid

This paper describes the FPGA implementation of FastCrypto, which extends a general-purpose processor with a crypto coprocessor for encrypting/decrypting data. Moreover, it studies the trade-offs between FastCrypto performance and design parameters, including the number of stages per round, the number of parallel Advance Encryption Standard (AES) pipelines, and the size of the queues. Besides, it shows the effect of memory latency on the FastCrypto performance. FastCrypto is implemented with VHDL programming language on Xilinx Virtex V FPGA. A throughput of 222 Gb/s at 444 MHz can be achieved on four parallel AES pipelines. To reduce the power consumption, the frequency of four parallel AES pipelines is reduced to 100 MHz while the other components are running at 400 MHz. In this case, our results show a FastCrypto performance of 61.725 bits per clock cycle (b/cc) when 128-bit single-port L2 cache memory is used. However, increasing the memory bus width to 256-bit or using 128-bit dual-port memory, improves the performance to 112.5 b/cc (45 Gb/s at 400 MHz), which represents 88% of the ideal performance (128 b/cc).

Journal of Parallel and Distributed Computing | 2008

A highly efficient implementation of a backpropagation learning algorithm using matrix ISA

Mostafa I. Soliman; Samir A. Mohamed

BackPropagation (BP) is the most famous learning algorithm for Artificial Neural Networks (ANN). BP has received intensive research efforts to exploit its parallelism in order to reduce the training time for complex problems. A modified version of BP based on matrix-matrix multiplication was proposed for parallel processing. In this paper, we present the implementation of Matrix BackPropagation (MBP) using scalar, vector, and matrix Instruction Set Architectures (ISAs). Besides this, we show that the performance of the MBP is improved by switching from scalar ISA to vector ISA. It is further improved by switching from vector ISA to matrix ISA. On a practical application, speech recognition, the speedup of training a neural network using unrolling scalar ISA over scalar ISA is 1.83. On eight parallel lanes, the speedups of using vector, unrolling vector, and matrix ISAs are respectively 10.33, 11.88, and 15.36, where the maximum theoretical speedup is 16. The results obtained show that the use of matrix ISA gives a performance close to optimal, because of reusing the loaded data, decreasing the loop overhead, and overlapping the memory operations with arithmetic operations.

high performance computing and communications | 2007

A block JRS algorithm for highly parallel computation of SVDs

Mostafa I. Soliman; Sanguthevar Rajasekaran; Reda A. Ammar

This paper presents a new algorithm for computing the singular value decomposition (SVD) on multilevel memory hierarchy architectures. This algorithm is based on one-sided JRS iteration, which enables the computation of all Jacobi rotations of a sweep in parallel. One key point of our proposed block JRS algorithm is reusing the loaded data into cache memory by performing computations on matrix blocks (b rows) instead of on strips of vectors as in JRS iteration algorithms. Another key point is that on a reasonably large number of processors the number of sweeps is less than that of one-sided JRS iteration algorithm and closer to the cyclic Jacobi method even though not all rotations in a block are independent. The relaxation technique helps to calculate and apply all independent rotations per block at the same time. On blocks of size b×n, the block JRS performs O(b2n) floating-point operations on O(bn) elements, which reuses the loaded data in cache memory by a factor of b. Besides, on P parallel processors, (2P-1) steps based on block computations are needed per sweep.

international conference on computer engineering and systems | 2007

Mat-core: A matrix core extension for general-purpose processors

Mostafa I. Soliman

This paper proposes new processor architecture to exploit the increasingly number of transistors per integrated circuit and improve the performance of data parallel applications on general-purpose processors. The proposed processor (called Mat-core) is based on the use of multi-level ISA to explicitly communicate data parallelism to processor in a compact way instead of the dynamic extraction using complex hardware or the static extraction using sophisticated complier techniques. Since the fundamental data structures for data parallel applications are scalar, vector, and matrix data, Mat-core extends a scalar core (for executing scalar instructions) with a matrix unit (for executing vector/matrix instructions). Like vector microarchitectures, the extended matrix unit is organized in parallel lanes; each lane contains a pipeline of each functional unit and a slice of the register file. However, the Mat-core processor can effectively process not only vector but also matrix data on the parallel lanes. The increasingly budget of transistors can be exploited to scale the Mat-core processor by providing more cores in a physical package. On a multi-core processor, performance would be improved by parallel processing threads of codes using multi-threading techniques.

international computer engineering conference | 2011

Design and FPGA implementation of a simplified matrix processor

Mostafa I. Soliman; Elsayed A. Elsayed

Data-parallel kernels dominate the computational workload in a wide variety of demanding applications. Since the fundamental data structures for a wide variety of data-parallel applications are scalar, vector, and matrix, this paper proposes a simple matrix processor (SMP) for executing scalar/vector/matrix instructions. Instead of using accelerators to improve the performance of data-parallel applications, SMP uses multi-level ISA to express parallelism to common hardware. SMP extends the well known 5-stage pipeline with matrix register file and matrix control unit in the decode stage. Scalar/vector/matrix instructions are fetched from instruction cache, decoded, and executed on the same execution datapath. On Xilinx Virtex-5 FPGA targeting xc5vlx50-3ff1153 device, SMP requires 4,138 slices, where the number of slice flip flops is 5,853 and the number of 4 input LUTs is 12,840: 12,540 for logic and 300 for RAMs. Moreover, the FPGA implementation of SMP operates at 108 MHz. Our results show speedup of 2.84, 3.82, 3.88, and 7.43 times over scalar execution on SAXPY, vector addition, vector scaling, and matrix-matrix multiplication, respectively.

2009 4th International Conference on Design & Technology of Integrated Systems in Nanoscal Era | 2009

SystemC implementation of mat-core: A matrix core extension for general-purpose processors

Mostafa I. Soliman; Abdulmajid F. Al-Junaid

Technological advances in IC manufacturing provide us with the capability to integrate more and more functionality into a single chip. Todays modern processors have nearly one billion transistors on a single chip. With the increasing complexity of todays system, the designs have to be modeled at a high-level of abstraction before partitioning into hardware and software components for final implementation. This paper explains in detail the implementation of a matrix processor called Mat-Core with SystemC (system level modeling language). Mat-Core is a research processor aiming at exploiting the increasingly number of transistors per IC to improve the performance of a wide range of applications. It extends a general-purpose scalar processor with a matrix core. Like vector architectures, the extended matrix core is organized in parallel lanes. In addition to vector-scalar and vector-vector instructions, the matrix core can execute matrix-vector and matrix-matrix instructions. Furthermore, for controlling the execution of vector/matrix instructions on the matrix core, this paper extends the well known scoreboard technique.

international conference on computer engineering and systems | 2011

Shared cryptography accelerator for multicores to maximize resource utilization

Mostafa I. Soliman; Ghada Y. Abozaid

This paper proposes a single crypto unit sharing multicores to accelerate the execution of cryptography applications and to make efficient use of the on-chip resources. The shared accelerator is based on the AES algorithm, where parallel AES pipelines are used for high throughput encrypting/decrypting data. For simplicity, the host processor contains four cores; each core consists of a simple five-stage, single-issue pipeline. Each core fetches an instruction from its instruction cache and sends it in-order to the decode stage. Crypto instructions are pushed into the crypto instruction queue (CIQ) during the decode stage, however, scalar instructions complete the remaining cycle of execution on the scalar pipeline stages. There is a CIQ in the shared crypto unit for each core, where crypto instructions are read from CIQs in round-robin fashion for execution on the parallel AES pipelines. On Xilinx Virtex V FPGA, our results show a maximum throughput of 45 Giga bits per second at 400 MHz.

international computer engineering conference | 2011

Efficient implementation of QR decomposition on intel multi-core processors

Mostafa I. Soliman

This paper shows how to make the QR decomposition algorithm run faster on Intel multi-core processors by exploiting explicit parallelism and memory hierarchy. Streaming SIMD extensions and multithreading computation on multiple cores are used to exploit data-level parallelism (DLP) and thread-level parallelism (TLP), respectively. In addition, memory hierarchy is exploited by performing the QR computation on blocks of data to reduce the impact of memory latency by reusing the loaded data in cache memories. On Core 2 Duo E7500 with two cores (2-physical/2-logical processors), Core i5 M520 with two cores supporting Hyper-Threading technology (2-physical/4-logical processors), and Xeon E5410 with four cores (4-physical/4-logical processors), the average speedup of multithreaded SIMD implementation of the block QR decomposition on 1000×1000 up to 3000×3000 matrices in step of 100 are about 6.6, 9.6, and 11.3 times higher than the unparallel execution, respectively. On reasonably large matrix size 2000 × 2000 (4000 × 4000), our experimental results show that the use of Intel streaming SIMD extensions, multithreading, SIMD multithreading, matrix blocking, blocking SIMD, blocking multithreading, and blocking SIMD multithreading speedup QR decomposition on Core 2 Duo E7500 by factors of about 2.1 (2.1), 1.8 (1.8), 2.2 (2.2), 1.7 (1.7), 5.6 (5.6), 2.7 (2.6), and 6.6 (6.3), on Core i5 M520 by factors of about 3.7 (3.6), 2.2 (2.6), 3.8 (4), 1.9 (1.9), 7.9 (7.8), 2.9 (3), and 9.6 (10.7), and on Xeon E5410 by factors of about 2.6 (2.3), 3.2 (2.8), 4.7 (3), 1.5 (1.5), 5.4 (4.9), 5 (5.1), and 12.1 (7), respectively.

international computer engineering conference | 2011

LcVc: Low-complexity vector-core for executing scalar/vector instructions

Mostafa I. Soliman

This paper proposes a low-complexity vector-core called LcVc for executing both scalar and vector instructions on the same execution datapath. A unified register file in the decode stage is used for storing both scalar operands and vector elements. The execution stage accepts a new set of operands each cycle and produces a new result. Rather than issuing vector instruction (1-D operations) as a whole, each vector operation is issued sequentially with the existing scalar issue hardware. All loads and stores of registers take place from the data cache in the memory access stage in a rate of one element per clock cycle. The hardware required to support the enhanced vector capability is insignificant (few incrementers and multiplexers), which results in reducing the area per core and increasing the number of cores available in a given chip area. Three key features distinguish LcVc architecture: a unified ISA to scalar and vector processing, low cost, and simplicity of organization. The use of LcVc approximately doubles the speedup of executing vector/matrix kernels such as vector addition, vector scaling, SAXPY, dot-product, matrix-vector multiplication, and matrix-matrix multiplication.

international conference on computer engineering and systems | 2010

Performance evaluation of a high throughput crypto coprocessor using VHDL

Mostafa I. Soliman; Ghada Y. Abozaid

FastCrypto is a general-purpose processor extended with a crypto coprocessor for high throughput encrypting/decrypting data. This paper studies the trade-offs between our proposed FastCrypto performance and its design parameters, including the number of stages per round, the number of parallel AES pipelines, and the size of the queues. Besides, it shows the effect of memory latency on the FastCrypto performance. A throughput of 222 Giga bits per second (Gb/s) at 444 MHz can be achieved on four parallel AES pipelines. To reduce the power consumption, the frequency of the parallel AES pipelines is reduced to 100 MHz while the other components are running at 400 MHz. Our results show a FastCrypto performance of 61.725 bits per clock cycle (b/cc) when 128-bit single port L2 cache memory is used. However, increasing the memory bus width to 256-bit or using 128-bit dual port memory, improves the performance to 112.5 b/cc (45 Gb/s at 400 MHz), which presents 88% of the ideal performance (128 b/cc).

Explore More