Aamir Shafi
National University of Sciences and Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Aamir Shafi.
international conference on cluster computing | 2006
Mark Baker; Bryan Carpenter; Aamir Shafi
MPJ Express is a thread-safe Java messaging library that provides a full implementation of the mpiJava 1.2 API specification. This specification defines a MPI-like bindings for the Java language. We have implemented two communication devices as part of our library, the first, called niodev is based on the Java new I/O package and the second, called mxdev is based on the Myrinet eXpress library. MPJ Express comes with an experimental runtime, which allows portable bootstrapping of Java Virtual Machines across a cluster or network of computers. In this paper we describe the implementation of MPJ Express. Also, we present a performance comparison against various other C and Java messaging systems. A beta version of MPJ Express was released in September 2005
Journal of Parallel and Distributed Computing | 2009
Aamir Shafi; Bryan Carpenter; Mark Baker
Since its introduction in 1993, the Message Passing Interface (MPI) has become a de facto standard for writing High Performance Computing (HPC) applications on clusters and Massively Parallel Processors (MPPs). The recent emergence of multi-core processor systems presents a new challenge for established parallel programming paradigms, including those based on MPI. This paper presents a new Java messaging system called MPJ Express. Using this system, we exploit multiple levels of parallelism-messaging and threading-to improve application performance on multi-core processors. We refer to our approach as nested parallelism. This MPI-like Java library can support nested parallelism by using Java or Java OpenMP (JOMP) threads within an MPJ Express process. Practicality of this approach is assessed by porting to Java a massively parallel structure formation code from Cosmology called Gadget-2. We introduce nested parallelism in the Java version of the simulation code and report good speed-ups. To the best of our knowledge it is the first time this kind of hybrid parallelism is demonstrated in a high performance Java application.
international conference on communications | 2013
Syed I. A. Shah; Jannet Faiz; Maham Farooq; Aamir Shafi; Syed Akbar Mehdi
With the recent interest in Software Defined Networking, many OpenFlow controllers have been released for research and commercial use. However, little public knowledge exists about the architectural choices that allow one controller to outperform another in production environments. In this paper, we aim to identify key performance bottlenecks and good architectural choices for designing OpenFlow-based SDN controllers. With this aim in mind, we evaluate the performances of four prominent open-source OpenFlow controllers: NOX [1], Beacon [2], Maestro [3] and Floodlight [4]. Since these controllers support multi-threading, we deploy them on shared memory multicore machines and benchmark their key architectural components under different metrics including thread scalability, switch scalability and latency in a custom cluster testbed. Our results lead to important architectural guidelines that can be used to improve the scalability of existing controllers or to design new ones. We follow these guidelines to implement an OpenFlow controller which outperforms existing controllers on assorted scalability metrics.
principles and practice of programming in java | 2010
Aamir Shafi; Jawad Manzoor; Kamran Hameed; Bryan Carpenter; Mark Baker
With the transition to multicore processors almost complete, the parallel processing community is seeking efficient ways to port legacy message passing applications on shared memory and multicore processors. MPJ Express is our reference implementation of Message Passing Interface (MPI)-like bindings for the Java language. Starting with the current release, the MPJ Express software can be configured in two modes: the multicore and the cluster mode. In the multicore mode, parallel Java applications execute on shared memory or multicore processors. In the cluster mode, Java applications parallelized using MPJ Express can be executed on distributed memory platforms like compute clusters and clouds. The multicore device has been implemented using Java threads in order to satisfy two main design goals of portability and performance. We also discuss the challenges of integrating the multicore device in the MPJ Express software. This turned out to be a challenging task because the parallel application executes in a single JVM in the multicore mode. On the contrary in the cluster mode, the parallel user application executes in multiple JVMs. Due to these inherent architectural differences between the two modes, the MPJ Express runtime is modified to ensure correct semantics of the parallel program. Towards the end, we compare performance of MPJ Express (multicore mode) with other C and Java message passing libraries---including mpiJava, MPJ/Ibis, MPICH2, MPJ Express (cluster mode)---on shared memory and multicore processors. We found out that MPJ Express performs signicantly better in the multicore mode than in the cluster mode. Not only this but the MPJ Express software also performs better in comparison to other Java messaging libraries including mpiJava and MPJ/Ibis when used in the multicore mode on shared memory or multicore processors. We also demonstrate effectiveness of the MPJ Express multicore device in Gadget-2, which is a massively parallel astrophysics N-body siimulation code.
international conference on computational science | 2006
Mark Baker; Bryan Carpenter; Aamir Shafi
One of the most challenging aspects to designing a Java messaging system for HPC is the intermediate buffering layer. The lower and higher levels of the messaging software use this buffering layer to write and read messages. The Java New I/O package adds the concept of direct buffers, which—coupled with a memory management algorithm—opens the possibility of efficiently implementing this buffering layer. In this paper, we present our buffering strategy, which is developed to support efficient communications and derived datatypes in MPJ Express—our implementation of the Java MPI API. We evaluate the performance of our buffering layer and demonstrate the usefulness of direct byte buffers.
international parallel and distributed processing symposium | 2009
Aamir Shafi; Jawad Manzoor
The need to increase performance while conserving energy lead to the emergence of multi-core processors. These processors provide a feasible option to improve performance of software applications by increasing the number of cores, instead of relying on increased clock speed of a single core. The uptake of multi-core processors by hardware vendors present variety of challenges to the software community. In this context, it is important that messaging libraries based on the Message Passing Interface (MPI) standard support efficient inter-core communication. Typically processing cores of todays commercial multi-core processors share the main memory. As a result, it is vital to develop devices to exploit this. MPJ Express is our implementation of the MPI-like Java bindings. The software has mainly supported communication with two devices; the first is based on Java New I/O (NIO) and the second is based on Myrinet. In this paper, we present two shared memory implementations meant for providing efficient communication of multi-core and SMP clusters. The first implementation is pure Java and uses Java threads to exploit multiple cores. Each Java thread represents anMPI level OS process and communication between these threads is achieved using shared data structures. The second implementation is based on the System V (SysV) IPC API. Our goal is to achieve better communication performance than already existing devices based on Transmission Control Protocol (TCP) and Myrinet on SMP and multi-core platforms. Another design goal is that existing parallel applications must not be modified for this purpose, thus relieving application developers from extra efforts of porting their applications to such modern clusters. We have benchmarked our implementations and report that threads-based device performs the best on an Intel quad-core Xeon cluster.
acm symposium on parallel algorithms and architectures | 2012
I-Ting Angelina Lee; Aamir Shafi; Charles E. Leiserson
Reducer hyperobjects (reducers) provide a linguistic abstraction for dynamic multithreading that allows different branches of a parallel program to maintain coordinated local views of the same nonlocal variable. In this paper, we investigate how thread-local memory mapping (TLMM) can be used to improve the performance of reducers. Existing concurrency platforms that support reducer hyperobjects, such as Intel Cilk Plus and Cilk++, take a hypermap approach in which a hash table is used to map reducer objects to their local views. The overhead of the hash table is costly --- roughly 12x overhead compared to a normal L1-cache memory access on an AMD Opteron 8354. We replaced the Intel Cilk Plus runtime system with our own Cilk-M runtime system which uses TLMM to implement a reducer mechanism that supports a reducer lookup using only two memory accesses and a predictable branch, which is roughly a 3x overhead compared to an ordinary L1-cache memory access. An empirical evaluation shows that the Cilk-M memory-mapping approach is close to 4x faster than the Cilk Plus hypermap approach. Furthermore, the memory-mapping approach admits better locality than the hypermap approach during parallel execution, which allows an application using reducers to scale better.
Lecture Notes in Computer Science | 2006
Mark Baker; Bryan Carpenter; Aamir Shafi
Gadget-2 is a massively parallel structure formation code for cosmological simulations. In this paper, we present a Java version of Gadget-2. We evaluated the performance of the Java version by running colliding galaxies simulation and found that it can achieve around 70% of C Gadget-2s performance.
Concurrency and Computation: Practice and Experience | 2011
Guillermo L. Taboada; Juan Touriño; Ramón Doallo; Aamir Shafi; Mark Baker; Bryan Carpenter
Since its release, the Java programming language has attracted considerable attention from the high‐performance computing (HPC) community because of its portability, high programming productivity, and built‐in multithreading and networking support. As a consequence, several initiatives have been taken to develop a high‐performance Java message‐passing library to program distributed memory architectures, such as clusters. The performance of Java message‐passing applications relies heavily on the communications performance. Thus, the design and implementation of low‐level communication devices that support message‐passing libraries is an important research issue in Java for HPC. MPJ Express is our Java message‐passing implementation for developing high‐performance parallel Java applications. Its public release currently contains three communication devices: the first one is built using the Java New Input/Output (NIO) package for the TCP/IP; the second one is specifically designed for the Myrinet Express library on Myrinet; and the third one supports thread‐based shared memory communications. Although these devices have been successfully deployed in many production environments, previous performance evaluations of MPJ Express suggest that the buffering layer, tightly coupled with these devices, incurs a certain degree of copying overhead, which represents one of the main performance penalties. This paper presents a more efficient Java message‐passing communications device, based on Java Input/Output sockets, that avoids this buffering overhead. Moreover, this device implements several strategies, both in the communication protocol and in the HPC hardware support, which optimizes Java message‐passing communications. In order to evaluate its benefits, this paper analyzes the performance of this device comparatively with other Java and native message‐passing libraries on various high‐speed networks, such as Gigabit Ethernet, Scalable Coherent Interface, Myrinet, and InfiniBand, as well as on a shared memory multicore scenario. The reported communication overhead reduction encourages the upcoming incorporation of this device in MPJ Express (http://mpj‐express.org). Copyright
international parallel and distributed processing symposium | 2008
Aamir Shafi; Aftab Hussain; Jamil Raza
This paper presents and evaluates a parallel Java implementation of the Finite-Difference Time-Domain (FDTD) method, which is a widely used numerical technique in computational electrodynamics. The Java version is parallelized using MPJ Express - a thread-safe messaging library. MPJ Express provides a full implementation of the mpiJava 1.2 API specification. This specification defines a MPI-like binding for the Java language. This paper describes our experiences of implementing the Java version of the FDTD method. Towards the end of this paper, we evaluate and compare the performance of the Java version against its C counterpart on a 32 processing core Linux cluster of eight compute nodes.