Ali A. El-Moursy
University of Sharjah
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ali A. El-Moursy.
IEEE Transactions on Very Large Scale Integration Systems | 2015
Sayed Taha Muhammad; Rabab Ezz-Eldin; Magdy A. El-Moursy; Ali A. El-Moursy; Amr M. Refaat
A large amount of leakage power could be saved by increasing the number of idle virtual channels (VCs) in a network-on-chip (NoC). Low-leakage power switch is proposed to allow saving in power dissipation of the NoC. The proposed NoC switch employs power supply gating to reduce the power dissipation. Two power reduction techniques are exploited to design the proposed switch. Adaptive virtual channel technique is proposed as an efficient technique to reduce the active area using hierarchical multiplexing tree. Moreover, power gating (PG) reduces the average leakage power consumption of proposed switch. The proposed techniques save up to 97% of the switch leakage power. In addition, the dynamic power is reduced by 40%. The traffic-based virtual channel activation (TVA) algorithm is used to determine the traffic status and send adaptation signals to PG units to activate/deactivate the VCs. The TVA algorithm optimally utilizes VCs by deactivating idle VCs to guarantee high-leakage power saving with high throughput. TVA is an efficient and flexible algorithm that defines a set of parameters to be used to achieve minimum degradation in NoC throughput with maximum reduction in leakage power. The whole network average leakage power has been reduced by up to 80% for 2-D-mesh NoC with throughput degradation within only 1%. For 2-D-torus NoC, a saving in power of up to 84% is achieved with <;2% degradation in throughput. The implementation overhead of TVA is negligible.
ieee international conference on high performance computing data and analytics | 2012
Ali A. El-Moursy; Fadi N. Sibai
With the increase in the number of the cores integrated in the single-chip microprocessor, the design of an efficient shared Last-Level-Cache (LLC) becomes more critical to the microprocessor performance. In this paper the author proposes v-set cache design for LLC for multi-core microprocessors. The proposed design has the ability to cope with the variation in the set access pattern to reduce the conflict misses and at the same time take the advantage of accessing multiple cache blocks simultaneously for fast cache search of the set-associative LLC. On four-core microprocessor, newly proposed LLC design compared to conventional n-way set-associative cache and v-way cache achieves maximum speedup of 20% and 10%, and average speedup of 8% and 6% respectively.
international conference on energy aware computing | 2015
Sayed Taha Muhammad; Magdy A. El-Moursy; Ali A. El-Moursy; Amr M. Refaat
Low leakage power with maintained high throughput NoC is achieved. Traffic-based Virtual channel Activation (TVA) algorithm is presented to determine traffic load status at the NoC switch ports. Consequently adaptation signals are sent to activate or deactivate switch port VC groups. The algorithm is optimized to minimize power dissipation for a target throughput. TVA algorithm optimally utilizes VCs by deactivating idle VCs groups to guarantee high leakage power saving without affecting the NoC throughput. Network average leakage power has been reduced for different topologies (such as 2D-Mesh and 2D-Torus).
Journal of Circuits, Systems, and Computers | 2014
Ali A. El-Moursy; Fadi N. Sibai
Development in VLSI design allows multi- to many-cores to be integrated on a single microprocessor chip. This increase in the core count per chip makes it more critical to design an efficient memory sub-system especially the shared last level cache (LLC). The efficient utilization of the LLC is a dominant factor to achieve the best microprocessor throughput. Conventional set-associative cache cannot cope with the new access pattern of the cache blocks in the multi-core processors. In this paper, the authors propose a new design for LLC in multi-core processor. The proposed v-set cache design allows an adaptive and dynamic utilization of the cache blocks. Unlike lately proposed design such as v-way caches, v-set cache design limits the serial access of cache blocks. In our paper, we thoroughly study the proposed design including area and power consumption as well as the performance and throughput. On eight-core microprocessor, the proposed v-set cache design can achieve a maximum speedup of 25% and 12% and an average speedup of 16% and 6% compared to conventional n-way and v-way cache designs, respectively. The area overhead of v-set does not exceed 7% compared to n-way cache.
ieee international conference on high performance computing data and analytics | 2012
Walid El-Reedy; Ali A. El-Moursy; Hossam A. H. Fahmy
In modern computer systems, long memory latency is one of the main bottlenecks micro-architects are facing for leveraging the system performance especially for memory-intensive applications. This emphasises the importance of the memory access scheduling to efficiently utilize memory bandwidth. Moreover, in recent micro-processors, multithread and multicore is turned to be the default choice for their design. This resulted in more contention on memory. Hence, the effect of memory access scheduling schemes is more critical to the overall performance boost. Although memory access scheduling techniques have been recently proposed for performance improvement, most of them have overlooked the fairness among the running applications. Achieving both high-throughput and fairness simultaneously is challenging. In this paper, we focus on the basic idea of memory requests scheduling, which includes how to assign priorities to threads, what request should be served first, and how to achieve fairness among the running applications for multicore microprocessors. We propose two new memory access scheduling techniques FLRMR, and FIQMR. Compared to recently proposed techniques, on average, FLRMR achieves 8.64% speedup relative to LREQ algorithm, and FIQMR achieves 11.34% speedup relative to IQ-based algorithm. FLRMR outperforms the best of the other techniques by 8.1% in 8-cores workloads. Moreover, FLRMR improves fairness over LREQ by 77.2% on average.
Concurrency and Computation: Practice and Experience | 2014
Ali A. El-Moursy; Hanan Elazhary; Akmal A. Younis
Scientific applications represent a dominant sector of compute‐intensive applications. Using massively parallel processing systems increases the feasibility to automate such applications because of the cooperation among multiple processors to perform the designated task. This paper proposes a parallel hidden Markov model (HMM) algorithm for 3D magnetic resonance image brain segmentation using two approaches. In the first approach, a hierarchical/multilevel parallel technique is used to achieve high performance for the running algorithm. This approach can speed up the computation process up to 7.8× compared with a serial run. The second approach is orthogonal to the first and tries to help in obtaining a minimum error for 3D magnetic resonance image brain segmentation using multiple processes with different randomization paths for cooperative fast minimum error convergence. This approach achieves minimum error level for HMM training not achievable by the serial HMM training on a single node. Then both approaches are combined to achieve both high accuracy and high performance simultaneously. For 768 processing nodes of a Blue Gene system, the combined approach, which uses both methods cooperatively, can achieve high‐accuracy HMM parameters with 98% of the error level and 2.6× speedup compared with the pure accuracy‐oriented approach alone. Copyright
Concurrency and Computation: Practice and Experience | 2011
Ali A. El-Moursy; Fadi N. Sibai
Two image processing applications, edge detection and image resizing, are studied in this paper on two HPC platforms namely the Cell BE and the Blue Gene/L machines. In this paper we focus on the performance scalability of the studied applications. Our results show that the scale of the problem to be solved highly affects the fitness of the platform. If the data set size is to fit into the Cell core, the fast on‐chip inter‐core communication of a multi‐core system pays back for its high technology design. On the other hand, the overhead of the distant communication in the massively parallel Blue Gene/L machine will only show its benefits for huge data set size that otherwise mandates multiple round‐trip data communications between the local memory of a core and main memory. Copyright
Proceedings of the 1st international forum on Next-generation multicore/manycore technologies | 2008
Ali A. El-Moursy; Ahmed El-Mahdy; Hisham El-Shishiny
3D transpose is an important operation in many large scale scientific applications such as seismic and medical imaging. This paper proposes a novel algorithm for fast in-place 3D transpose operation. The algorithm exploits Single Instruction Multiple Data (SIMD) multicore architecture with software managed memory hierarchy. Such architectural features are present in the next generation processors, such as the Cell Broadband Engine (Cell BE) processor. The algorithm performs transposition at two levels of granularity: at coarse level, where logical transposition is done by merely transposing the address map at each access; and at a fine grain level, where physical transposition is done by actual element displacement/swapping. Such mix combines the benefits of allowing for fast on-chip bandwidth by providing for large transfer sizes, and at the same time allows for fine-grain SIMD operations. The transfer rate is further enhanced by allowing for batch transposing spatially joined data along a major axis. Results on the Cell BE processor show substantial utilisation of on-chip communication bandwidth, and negligible processing time.
Journal of Parallel and Distributed Computing | 2017
Sayed Taha Muhammad; Magdy A. El-Moursy; Ali A. El-Moursy; Hesham F. A. Hamed
Synchronous NoCs suffer from performance degradation due to clock skew. Clock skew is more pronounced with process variation (PV). Although asynchronous NoCs suffer from handshaking overhead, their immunity to PV is better than synchronous networks which would favor them in terms of throughput. Architecture-Level analysis aims to determine the ability of different NoC communication schemes to mitigate the impact of PV. The proposed analysis depends on redeveloped simulator which is unique PV-aware simulator for both synchronous and asynchronous NoCs.Architecture-Level simulation shows that clock skew causes significant performance degradation in synchronous networks. Clock skew represents 27% and 32% of the delay variation for 45źnm and 32źnm technologies, respectively. Using real traffic, Architecture-Level analysis shows considerable throughput reduction for synchronous NoC under PV conditions. Throughput degradation of synchronous NoC increases rapidly with technology scaling down. 64-Cores synchronous NoC loses 30% of the nominal throughput for 45źnm technology and 41% of throughput for 32źnm with PV. On the other hand, 64-Cores asynchronous network throughput degradation is 12% and 13.6% for 45źnm and 32źnm technologies, respectively. For different NoC dimensions and using different workloads, throughput reduction for synchronous design is more than double the reduction of asynchronous design. Asynchronous scheme is preferable as technology scales. At System Level, interconnect PV delay is not ignorable anymore (9% of total delay).Clock skew represents 32% of total NoC delay for synchronous NoCs at System Level.Throughput reduction for asynchronous NoC is less than half that of synchronous NoC.Asynchronous design can better adapt to PV as compared to synchronous design.Technology scaling perf. variation of System-Level is similar to Circuit analysis.
Journal of Circuits, Systems, and Computers | 2015
Ali A. El-Moursy; Wael S. Afifi; Fadi N. Sibai; Salwa Nassar
STRIKE is an algorithm which predicts protein–protein interactions (PPIs) and determines that proteins interact if they contain similar substrings of amino acids. Unlike other methods for PPI prediction, STRIKE is able to achieve reasonable improvement over the existing PPI prediction methods. Although its high accuracy as a PPI prediction method, STRIKE consumes a large execution time and hence it is considered to be a compute-intensive application. In this paper, we develop and implement a parallel STRIKE algorithm for high-performance computing (HPC) systems. Using a large-scale cluster, the execution time of the parallel implementation of this bioinformatics algorithm was reduced from about a week on a serial uniprocessor machine to about 16.5 h on 16 computing nodes, down to about 2 h on 128 parallel nodes. Communication overheads between nodes are thoroughly studied.