Fadi N. Sibai
College of Information Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Fadi N. Sibai.
international multi-conference on systems, signals and devices | 2008
Hashir Karim Kidwai; Fadi N. Sibai; Tamer Rabie
The IBM Cell Broadband Engine (BE) is a multi-core processor with a PowerPC host processor (PPE) and 8 synergic processor engines (SPEs). The Cell BE architecture is designed to improve upon conventional processors in terms of memory latency, bandwidth and compute power. In this paper, we describe a 2D graphics algorithm for image resizing which we parallelized and developed on the Cell BE. We report the performance measured on one Cell blade with varying numbers of synergic processor engines enabled. These results were compared to those obtained on the Cellpsilas single PPE and with all 8 SPEs disabled. The results indicate that the Cell processor can outperform modern RISC processors by 20x on SIMD compute intensive applications such as image resizing.
Microprocessors and Microsystems | 2009
Abu Asaduzzaman; Fadi N. Sibai; Manira Rani
In order to satisfy the needs for increasing computer processing power, there are significant changes in the design process of modern computing systems. Major chip-vendors are deploying multicore or manycore processors to their product lines. Multicore architectures offer a tremendous amount of processing speed. At the same time, they bring challenges for embedded systems which suffer from limited resources. Various cache memory hierarchies have been proposed to satisfy the requirements for different embedded systems. Normally, a level-1 cache (CL1) memory is dedicated to each core. However, the level-2 cache (CL2) can be shared (like Intel Xeon and IBM Cell) or distributed (like AMD Athlon). In this paper, we investigate the impact of the CL2 organization type (shared Vs distributed) on the performance and power consumption of homogeneous multicore embedded systems. We use VisualSim and Heptane tools to model and simulate the target architectures running FFT, MI, and DFT applications. Experimental results show that by replacing a single-core system with an 8-core system, reductions in mean delay per core of 64% for distributed CL2 and 53% for shared CL2 are possible with little additional power (15% for distributed CL2 and 18% for shared CL2) for FFT. Results also reveal that the distributed CL2 hierarchy outperforms the shared CL2 hierarchy for all three applications considered and for other applications with similar code characteristics.
Microprocessors and Microsystems | 2008
Fadi N. Sibai
The benefits and deficiencies of shared and private caches have been identified by researchers. The performance impact of privatizing or sharing caches on homogeneous multi-core architectures is less understood. This paper investigates the performance impact of cache sharing on a homogeneous same-ISA 16-core processor with private first-level (L1) caches by considering 3 cache models which vary the sharing property of second-level (L2) and third-level (L3) cache banks. It is observed that across many scenarios, the cache privatizations average memory access time improved as the L1 cache miss rate increased and/or the cross-partition interconnect latencies increased. Under uniform memory address distribution, and when the L3 cache miss rate is close to 0, privatizing both L2s and L3s performs best among the 3 cache models. Furthermore, we mathematically demonstrate that when the interconnects bridge latency is below 264 cycles, privatizing L2 caches beats privatizing both L2 and L3 caches, while the reverse is true for large bridge latencies representing high-traffic and heavy workload applications. For large interconnect delays, the private L2 and L3 model is best. For low to moderate interconnect latencies, and when the L3 miss rate is not close to 0, sharing both L2 and L3 banks among all cores performs best followed by privatizing L2s, while privatizing both L2s and L3s ranks last. Under worst case address distributions, cache privatizing benefits generally increase, and with large bridge latencies, privatizing L2 and L3 banks outperforms the other cache models. This reveals that as application workloads become heavier with time, resulting in large cache miss rates and long bridge and interconnect delays, privatizing L2 and L3 caches may prove beneficial. Under less stressful workloads, sharing both L2 and L3 caches have the upper hand. This study confirms the desired configurability and flexibility of the cache memorys sharing degree based on the running workload.
international conference on parallel processing | 2009
Fadi N. Sibai
Networks-on-chip in many-core embedded systems consume large portions of the chip’s area, cost, delay and power. In real-time embedded systems meeting the real time targets is critical. Therefore networks-on-chip must provide a communication infrastructure with worst case delays acceptably low to meet the time deadlines. This requirement directly translates into scalable networks with low diameters. Furthermore, with a large number of cores, the cost, area, and power become prime issues. One way to achieve these goals is by sharing system resources such as switches and employing circuit switching. We explore 4 on-chip interconnection networks (OCINs) in 64-core systems with switches shared by cores in core clusters and estimate their worst case latencies with Peh and Dally’s router delay model and published wire delays. For these five OCINs, we also derive their diameters, average delays, switch degrees, and total link costs and compare them to the standard 2D Mesh OCIN. Results indicate that switch sharing by core clusters is effective in reducing the worst case and average communication delays, and the total number of links and switches.
international symposium on parallel and distributed processing and applications | 2008
Fadi N. Sibai
This paper makes the case for the Hyper-Ring as the interconnect or NoC for many-cores. While other prominent candidates for many-core interconnect such as the torus and mesh have superior bisection bandwidth to the HR, their cost, number of links and chip area are much higher than the HR. The worst-case latency or maximum hop count is relatively inferior on the mesh, while that of the HR is comparative to that of the torus for about a 100 core nodes. Moreover, the HR resembles more the LANs and WANs which have dedicated gateway or router nodes. We present a many-core floorplan where adjacent pairs of cores share an L2 cache memory and where the cores are interconnected according the HR topology. We also address L2 cache partitioning and coherence in HR-connected many-core processors and how such a processor can be tolerant of faulty links and nodes. The HR-connected many-core processor is naturally tolerant of core and link faults owing to its double second directional rings. To keep the same amount of operational hardware and the same performance levels in the presence of faults, we present a circuit for bypassing faulty cores and replacing them by spare cores.
international conference on innovations in information technology | 2006
Fadi N. Sibai
One quick and fairly accurate way to assess the performance of a computer system and its components is to run the popular PCMarkreg05 (Niemela, 2005) benchmark. We analyze the performance scaling of the PCMarkreg05 CPU test suite with higher CPU frequencies and number of logical processors and/or CPU cores, and dissect this suite and characterize its workload in an attempt to assess performance scaling, and to extract performance data useful to guide processor design enhancements. The results indicate that this benchmark is suitable for distinguishing the performance of multi-core processors with up to 4 hardware threads. On multithreaded tests, score gains due to doubling the CPUs exceeded score gains due to increasing the frequency of one CPU core by 1GHz
trans. computational science | 2009
Fadi N. Sibai; Hashir Karim Kidwai
With the oil barrel price presently crippling the world economy, developing fast oil reservoir simulators is as important as ever. This article describes the parallelization and development of a 2-phase oil-water reservoir simulator on the state-of-the-art IBM Cell computer. The interdependent linear algebraic equations of the reservoir simulator is presented as well as the pipelined time step parallelization approach adopted on the Cell, The performance results reveal that given the largely interdependent nature of the oil reservoir model equations which highly limits parallelism, speedups of 6x or higher could be obtained. This speedup is significant as it results in oil simulation runs cut from weeks to days, allowing for more simulation runs with various well placements to run on the same hardware, and resulting in better reservoir management, and possibly higher oil production. The results also demonstrate that the oil reservoir simulator application is characterized by higher speedups with increasing grid size. However the speedup was shown to go down with increased number of time steps as the main memory transfer overhead becomes an important factor. Proper choice of compiler optimization flags helped boost the performance by a factor of 2x. Our parallelization approach is economically feasible due to the affordable cost of the widely available Cell-based Playstation 3.
International Journal of High Performance Systems Architecture | 2008
Fadi N. Sibai
Homogeneous multi-core architectures come in a variety of floorplans. A large portion of the chip area is occupied by the second level cache memories and the interconnect. Usually, the floorplan is dictated by the interconnect type and the second level cache memory size and by whether large resources on the chip are shared or private. We consider floorplans for a 16-core architecture consisting of identical cores and estimate their areas based on two different processor cores. Then, we focus on an area-efficient and performance-scaling four-partition crossbar-interconnected architecture with third level cache memory and conduct performance modelling and evaluation of the multi-core architectures memory system. With a database workload and a first level cache miss rate under 20%, the Average Memory Access Time (AMAT) is estimated to be under 20 processor cycles. With higher memory contention resulting into longer bridge queue wait times, the various cache miss rates rate make a more pronounced effect on the Chip Multiprocessors AMAT, necessitating design measures like proper cache sizing. When the size of shared resources becomes too large, recent work in sharing and partitioning large resources in CMP architectures becomes crucial.
high performance computing and communications | 2009
Fadi N. Sibai; Hashir Karim Kidwai
Oil reservoir simulation helps in extracting oil and in optimal well placement. This paper presents the parallelization, development, performance analysis, and profiling of a 2-phase oil-water reservoir simulator on a heterogeneous multi-core STI Cell computer. Despite the largely interdependent nature of the oil reservoir model equations, we obtained speedups of 6x with a 1D reservoir data. We boosted the performance by 2x with the right selection of compiler options. We also ran our application on the SystemSim simulator for collection of performance events. Profiling and trace analysis of the application were conducted through VAMPIR, a performance and trace analysis tool for HPC applications. The results from both tools are also presented.
international conference on innovations in information technology | 2008
Hashir Karim Kidwai; Fadi N. Sibai; Tamer Rabie
The IBM Cell Broadband Engine (BE) is a multi-core processor with a PowerPC host processor (PPE) and 8 synergic processor engines (SPEs). The Cell BE architecture is designed to improve upon conventional processors in terms of memory latency, bandwidth and power computation. In this paper, we discuss the parallelization, implementation and performance of the edge detection image processing application on the IBM Cell BE. We report the edge detection performance measured on a computer with one Cell processor and with varying numbers of synergic processor engines enabled. These results were compared to the results obtained on the Cells single PPE with all 8 SPEs disabled. The results indicate that edge detection performs 10 times faster on the Cell BE than modern RISC processors.