Yasser Y. Hanafy
Virginia Tech
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yasser Y. Hanafy.
Computing in Science and Engineering | 2009
Amr M. Bayoumi; Michael Chu; Yasser Y. Hanafy; Patricia Harrell; Gamal Refai-Ahmed
This continuing exploration of GPU technology examines ATI Stream technology and its use in scientific and engineering applications.
Proceedings of the 1st international forum on Next-generation multicore/manycore technologies | 2008
Amr M. Bayoumi; Yasser Y. Hanafy
Device model evaluation is one of the most time-consuming tasks in analog simulators such as SPICE. Graphics Processing Unit (GPU) architectures allow massive utilization of vector data on SIMD architectures. In this paper, the formulation of double precision device model equations into a form compatible with stream computing is presented. We show data on isolating typical bottlenecks, especially the communication and kernel call overheads. Our results indicate speedup of up to 20X when counting overheads, and up to 50X when using techniques to overcome these overheads. In particular, we show that our techniques are valid for small device counts, which is typically a well known problem for accelerated parallel computing with communications overheads.
design automation conference | 2015
Ahmed E. Helal; Amr M. Bayoumi; Yasser Y. Hanafy
This paper discusses the development of a parallel SPICE circuit simulator using the direct method on a cloud-based heterogeneous cluster, which includes multiple HPC compute nodes with multi-sockets, multicores, and GPUs. A simple model is derived to optimally partition the circuit between the compute nodes. The parallel simulator is divided into four major kernels: Partition Device Model Evaluation (PME), Partition Matrix Factorization (PMF), Interconnection Matrix Evaluation (IME), and Interconnection Matrix Factorization (IMF). Another model is derived to assign each of the kernels to the most suitable execution platform of the Amazon EC2 heterogeneous cloud. The partitioning approach using heterogeneous resources has achieved an order-of-magnitude speedup over optimized multithreaded implementations of SPICE using state of the art KLU and NICSLU packages for matrix solution.
ieee international symposium on workload characterization | 2017
Ahmed E. Helal; Wu-chun Feng; Changhee Jung; Yasser Y. Hanafy
Porting sequential applications to heterogeneous HPC systems requires extensive software and hardware expertise to estimate the potential speedup and to efficiently use the available compute resources in such systems. To streamline this daunting process, researchers have proposed several “black-box” performance prediction approaches that rely on the performance of a training set of parallel applications. However, due to the lack of a diverse set of applications along with their optimized parallel implementations for each architecture type, the predicted speedup by these approaches is not the speedup upper-bound, and even worse it can be misleading, if the reference parallel implementations are not equally-optimized for every target architecture. This paper presents AutoMatch, an automated framework for matching of compute kernels to heterogeneous HPC architectures. AutoMatch uses hybrid (static and dynamic) analysis to find the best dependency-preserving parallel schedule of a given sequential code. The resulting operations schedule serves as a basis to construct a cost function of the optimized parallel execution of the sequential code on heterogeneous HPC nodes. Since such a cost function informs the user and runtime system about the relative execution cost across the different hardware devices within HPC nodes, AutoMatch enables efficient runtime workload distribution that simultaneously utilizes all the available devices in performance-proportional way. For a set of open-source HPC applications with different characteristics, AutoMatch turns out to be very effective, identifying the speedup upper-bound of sequential applications and how close the parallel implementation is to the best parallel performance across five different HPC architectures. Furthermore, AutoMatchs workload distribution scheme achieves approximately 90% of the performance of a profiling-driven oracle.
international midwest symposium on circuits and systems | 2013
Mohamed W. Hassan; Ahmed A. Abouel Farag; Yasser Y. Hanafy
Statically scheduled scientific computing problems represent a large set of problems which require intensive amount of computation. The common feature characteristics of this set of problems could be used to optimize an architecture, where the utilization exceeds 90% of the peak performance. The proposed architecture is an array of reconfigurable NISC (No Instruction Set Computer) processing elements (PE) connected by a reconfigurable NOC (Network On Chip). An optimized data path for a group of problems is suggested. The control of each PE is reconfigurable to customize for each application so as the NOC. The architecture is simulated using a tile of 64 PEs to run LU decomposition algorithm of a dense matrix, and the results show a performance of 177 GFLOPS, which outperforms the GPU NVIDIA 6800 & 7800 implementations and the OpenMP parallel programming multicore solution using an Intel core 2 quad cpu with four processors cores.
high performance distributed computing | 2018
Ahmed E. Helal; Changhee Jung; Wu-chun Feng; Yasser Y. Hanafy
To deliver scalable performance to large-scale scientific and data analytic applications, HPC cluster architectures adopt the distributed-memory model. The performance and scalability of parallel applications on such systems are limited by the communication cost across compute nodes. Therefore, projecting the minimum communication cost and maximum scalability of the user applications plays a critical role in assessing the benefits of porting these applications to HPC clusters as well as developing efficient distributed-memory implementations. Unfortunately, this task is extremely challenging for end users, as it requires comprehensive knowledge of the target application and hardware architecture and demands significant effort and time for manual system analysis. To streamline the process of porting user applications to HPC clusters, this paper presents CommAnalyzer, an automated framework for estimating the communication cost on distributed-memory models from sequential code. CommAnalyzer uses novel dynamic program analyses and graph algorithms to capture the inherent flow of program values (information) in sequential code to estimate the communication when this code is ported to HPC clusters. Therefore, CommAnalyzer makes it possible to project the efficiency/scalability upper-bound (i.e., Roofline) of the effective distributed-memory implementation before even developing one. The experiments with real-world, regular and irregular HPC applications demonstrate the utility of CommAnalyzer in estimating the minimum communication of sequential applications on HPC clusters. In addition, the optimized MPI+X implementations achieve more than 92% of the efficiency upper-bound across the different workloads.
field-programmable custom computing machines | 2015
Mohamed W. Hassan; Ahmed E. Helal; Yasser Y. Hanafy
Sparse LU solvers are common in several scientific problems. The hardware utilization of previous implementations on massively parallel platforms never exceeded the 20% mark (including multicores, GPU, and FPGA). This is due to the highly irregular computation and memory access pattern of the algorithm. Reconfigurable fabrics, with its spatial execution model, can expose the maximum inherent parallelism in the problem and achieve the highest hardware utilization. However, dynamic data flow models implementations suffer from large overhead and scalability issues. In this paper, we propose a static dataflow synchronous model that maximizes the utilization of FPGA-based architectures. Synchronous dataflow graph is mapped to a mesh of deeply-pipelined PEs to perform the factorization. This inspires the development of a customized data structure format that reduces memory accesses, indexing overhead and pipelining hazards. The hardware model is synthesized on a VIRTEX 7 FPGA and the results show a hardware utilization exceeding 60%, which was translated to more than 100 GFLOPS.
international conference on computer science and information technology | 2013
Mahmoud Eljammaly; Yasser Y. Hanafy; Abdelmoniem Wahdan; Amr M. Bayoumi
Recent FPGA technology advances permitted the hardware implementation of selected software functions to enhance programs performance. Most of the work done was only concerned with integer operations. Little effort addressed floating point operations. In this paper we propose a dataflow implementation of the LU decomposition on FPGA. A modified Kernighan-Lin based task partitioning and assignment algorithm is presented in this paper. The algorithm showed acceptable improvement over existing techniques.
national radio science conference | 2011
Amr M. Bayoumi; Yasser Y. Hanafy
As CMOS evolves from 32nm down to 16nm technologies, several technological changes suggest we can more efficiently use linear RC approximations to model the input stages of NMOS & PMOS. This improves the non-linearity of input stages of analog circuits such as RF amplifiers and buffers with input signal voltage levels, thus allowing better matching networks. This linearization is also critical when using fast spice in modeling the digital parts of a mixed signal RFICs. For physical gate lengths of 32 – 16nm, smaller gate area results in more pronounced role for overlap capacitance over source/drain (which is independent of voltage). Metal gates have replaced polysilicon, eliminating polysilicon depletion. This makes effective gate capacitance less voltage dependent in inversion. Metal gates have low resistivity, which makes non-quasi static characteristics easier to model and more uniform along the channel width, because of the reduction of the distributed gate resistance effect. Finally, using high dielectric constant (high-k) dielectrics to replace the thin gate oxides resulted in drastic reduction in gate leakage direct tunneling current, which is modeled as parallel conductance with an exponential dependence on applied gate voltage. In this paper, recently reported technology device features are used to update BSIM4 predictive technology models (PTM). The dependence of the NMOS & PMOS input equivalent circuits on applied biasing for 32–16nm gate lengths is simulated using SPICE circuit simulator.
Archive | 2017
Ahmed E. Helal; Changhee Jung; Wu-chun Feng; Yasser Y. Hanafy