Ahmad Faraj | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ahmad Faraj is active.

Explore More

Publication

Featured researches published by Ahmad Faraj.

international conference on supercomputing | 2005

Automatic generation and tuning of MPI collective communication routines

Ahmad Faraj; Xin Yuan

In order for collective communication routines to achieve high performance on different platforms, they must be able to adapt to the system architecture and use different algorithms for different situations. Current Message Passing Interface (MPI) implementations, such as MPICH and LAM/MPI, are not fully adaptable to the system architecture and are not able to achieve high performance on many platforms. In this paper, we present a system that produces efficient MPI collective communication routines. By automatically generating topology specific routines and using an empirical approach to select the best implementations, our system adapts to a given platform and constructs routines that are customized for the platform. The experimental results show that the tuned routines consistently achieve high performance on clusters with different network topologies.

international conference on supercomputing | 2006

STAR-MPI: self tuned adaptive routines for MPI collective operations

Ahmad Faraj; Xin Yuan; David K. Lowenthal

Message Passing Interface (MPI) collective communication routines are widely used in parallel applications. In order for a collective communication routine to achieve high performance for different applications on different platforms, it must be adaptable to both the system architecture and the application workload. Current MPI implementations do not support such software adaptability and are not able to achieve high performance on many platforms. In this paper, we present STAR-MPI (Self Tuned Adaptive Routines for MPI collective operations), a set of MPI collective communication routines that are capable of adapting to system architecture and application workload. For each operation, STAR-MPI maintains a set of communication algorithms that can potentially be efficient at different situations. As an application executes, a STAR-MPI routine applies the Automatic Empirical Optimization of Software (AEOS) technique at run time to dynamically select the best performing algorithm for the application on the platform. We describe the techniques used in STAR-MPI, analyze STAR-MPI overheads, and evaluate the performance of STAR-MPI with applications and benchmarks. The results of our study indicate that STAR-MPI is robust and efficient. It is able to and efficient algorithms with reasonable overheads, and it out-performs traditional MPI implementations to a large degree in many cases.

International Journal of Parallel Programming | 2008

A study of process arrival patterns for MPI collective operations

Ahmad Faraj; Pitch Patarasuk; Xin Yuan

Process arrival pattern, which denotes the timing when different processes arrive at an MPI collective operation, can have a significant impact on the performance of the operation. In this work, we characterize the process arrival patterns in a set of MPI programs on two common cluster platforms, use a micro-benchmark to study the process arrival patterns in MPI programs with balanced loads, and investigate the impacts of different process arrival patterns on collective algorithms. Our results show that (1) the differences between the times when different processes arrive at a collective operation are usually sufficiently large to affect the performance; (2) application developers in general cannot effectively control the process arrival patterns in their MPI programs in the cluster environment: balancing loads at the application level does not balance the process arrival patterns; and (3) the performance of collective communication algorithms is sensitive to process arrival patterns. These results indicate that process arrival pattern is an important factor that must be taken into consideration in developing and optimizing MPI collective routines. We propose a scheme that achieves high performance with different process arrival patterns, and demonstrate that by explicitly considering process arrival pattern, more efficient MPI collective routines than the current ones can be obtained.

international conference on supercomputing | 2009

MPI collective communications on the blue gene/p supercomputer: algorithms and optimizations

Ahmad Faraj; Sameer Kumar; Brian E. Smith; Amith R. Mamidala; John A. Gunnels; Philip Heidelberger

The IBM Blue Gene/P (BG/P) system is a massively parallel supercomputer succeeding BG/L, and it comes with many machine design enhancements and new architectural features at the hardware and software levels. This paper presents techniques leveraging such features to deliver high performance MPI collective communication primitives. In particular, we exploit BG/P rich set of network hardware in exploring three classes of collective algorithms: global algorithms on global interrupt and collective networks for MPI COMM WORLD; rectangular algorithms for rectangular communicators on the torus network; and binomial algorithms for irregular communicators over the torus point-to-point network. We also utilize various forms of data movements including the direct memory access (DMA) engine, collective network, and shared memory, to implement synchronous and asynchronous algorithms of different objectives and performance characteristics. Our performance study on BG/P hardware with up to 16K nodes demonstrates the efficiency and scalability of the algorithms and optimizations.

IEEE Transactions on Parallel and Distributed Systems | 2007

A Message Scheduling Scheme for All-to-All Personalized Communication on Ethernet Switched Clusters

Ahmad Faraj; Xin Yuan; Pitch Patarasuk

We develop a message scheduling scheme for efficiently realizing all-to-all personalized communication (AAPC) on Ethernet switched clusters with one or more switches. To avoid network contention and achieve high performance, the message scheduling scheme partitions AAPC into phases such that 1) there is no network contention within each phase and 2) the number of phases is minimum. Thus, realizing AAPC with the contention-free phases computed by the message scheduling algorithm can potentially achieve the minimum communication completion time. In practice, phased AAPC schemes must introduce synchronizations to separate messages in different phases. We investigate various synchronization mechanisms and various methods for incorporating synchronizations into the AAPC phases. Experimental results show that the message scheduling-based AAPC implementations with proper synchronization consistently achieve high performance on clusters with many different network topologies when the message size is large

international parallel and distributed processing symposium | 2006

Pipelined broadcast on Ethernet switched clusters

Pitch Patarasuk; Ahmad Faraj; Xin Yuan

We consider unicast-based pipelined broadcast schemes for clusters connected by multiple Ethernet switches. By splitting a large broadcast message into segments and broadcasting the segments in a pipelined fashion, pipelined broadcast may achieve very high performance. We develop algorithms for computing various contention-free broadcast trees on Ethernet switched clusters that are suitable for pipelined broadcast, and evaluate the schemes through experimentation. The conclusions drawn from our theoretical and experimental study include the following. First, pipelined broadcast can be more effective than other common broadcast schemes including the ones used in the latest versions of MPICH and LAM/MPI when the message size is sufficiently large. Second, contention-free broadcast trees are essential for pipelined broadcast to achieve high performance. Finally, while it is difficult to determine the optimal message segment size for pipelined broadcast, finding one size that gives good performance is relatively easy

international conference on cluster computing | 2005

Bandwidth Efficient All-to-All Broadcast on Switched Clusters

Ahmad Faraj; Pitch Patarasuk; Xin Yuan

Clusters of workstations employ flexible topologies: regular, irregular, and hierarchical topologies have been used in such systems. The flexibility poses challenges for developing efficient collective communication algorithms since the network topology can potentially have a strong impact on the communication performance. In this paper, we consider the all-to-all broadcast operation on clusters with cut-through and store-and-forward switches. We show that near-optimal all-to-all broadcast on a cluster with any topology can be achieved by only using the links in a spanning tree of the topology when the message size is sufficiently large. The result implies that increasing network connectivity beyond the minimum tree connectivity does not improve the performance of the all-to-all broadcast operation when the most efficient topology specific algorithm is used. All-to-all broadcast algorithms that achieve near-optimal performance are developed for clusters with cut-through and clusters with store-and-forward switches. We evaluate the algorithms through experiments and simulations. The empirical results confirm our theoretical finding.

international parallel and distributed processing symposium | 2005

Message scheduling for all-to-all personalized communication on ethernet switched clusters

Ahmad Faraj; Xin Yuan

We develop a message scheduling scheme that can theoretically achieve the maximum throughput for all-to-all personalized communication (AAPC) on any given Ethernet switched cluster. Based on the scheduling scheme, we implement an automatic routine generator that takes the topology information as input and produces a customized MPI/spl I.bar/Alltoall routine, a routine in the Message Passing Interface (MPI) standard that realizes AAPC. Experimental results show that the automatically generated routine consistently out-performs other MPLAlltoall algorithms, including those in LAM/MPI and MPICH, on Ethernet switched clusters with different network topologies when the message size is sufficiently large. This demonstrates the superiority of the proposed AAPC algorithm in exploiting network band-widths.

international conference on parallel processing | 2005

An empirical approach for efficient all-to-all personalized communication on Ethernet switched clusters

Ahmad Faraj; Xin Yuan

All-to-all personalized communication (AAPC) is one of the most commonly used communication patterns in parallel applications. Developing an efficient AAPC routine is difficult since many system parameters can affect the performance of an AAPC algorithm. In this paper, we investigate an empirical approach for automatically generating efficient AAPC routines for Ethernet switched clusters. This approach applies when the application execution environment is decided, and it allows efficient customized AAPC routines to be created. Experimental results show that the empirical approach generates routines that consistently achieve high performance on clusters with different network topologies. In many cases, the automatically generated routines out-perform conventional AAPC implementations to a large degree.

Journal of Parallel and Distributed Computing | 2008

Techniques for pipelined broadcast on ethernet switched clusters

Pitch Patarasuk; Xin Yuan; Ahmad Faraj

By splitting a large broadcast message into segments and broadcasting the segments in a pipelined fashion, pipelined broadcast can achieve high performance in many systems. In this paper, we investigate techniques for efficient pipelined broadcast on clusters connected by multiple Ethernet switches. Specifically, we develop algorithms for computing various contention-free broadcast trees that are suitable for pipelined broadcast on Ethernet switched clusters, extend the parametrized LogP model for predicting appropriate segment sizes for pipelined broadcast, show that the segment sizes computed based on the model yield high performance, and evaluate various pipelined broadcast schemes through experimentation on Ethernet switched clusters with various topologies. The results demonstrate that our techniques are practical and efficient for contemporary fast Ethernet and Giga-bit Ethernet clusters.

Explore More