Mohammad Zubair | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mohammad Zubair is active.

Explore More

Publication

Featured researches published by Mohammad Zubair.

ACM Transactions on Mathematical Software | 2010

Cache-optimal algorithms for option pricing

John E. Savage; Mohammad Zubair

Today computers have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this article, we study the computation of option pricing using the binomial and trinomial models on processors with a multilevel memory hierarchy. We derive lower bounds on memory traffic between different levels of the hierarchy for these two models. We also develop algorithms for the binomial and trinomial models that have near-optimal memory traffic between levels. We have implemented these algorithms on an UltraSparc IIIi processor with a 4-level of memory hierarchy and demonstrated that our algorithms outperform algorithms without cache blocking by a factor of up to 5 and operate at 70% of peak performance.

ieee international conference on high performance computing data and analytics | 2012

High Performance Implementation of an Econometrics and Financial Application on GPUs

Michael Creel; Mohammad Zubair

In this paper, we describe a GPU based implementation for an estimator based on an indirect likelihood inference method. This method relies on simulations from a model and on nonparametric density or regression function computations. The estimation application arises in various domains such as econometrics and finance, when the model is fully specified, but too complex for estimation by maximum likelihood. We implemented the estimator on a machine with two 2.67GHz Intel Xeon X5650 processors and four NVIDIA M2090 GPU devices. We optimized the GPU code by efficient use of shared memory and registers available on the GPU devices. We compared the optimized GPU code performance with a C based sequential version of the code that was executed on the host machine. We observed a speed up factor of up to 242 with four GPU devices.

international conference on high performance computing and simulation | 2010

An efficient multicore implementation of planted motif problem

Naga Shailaja Dasari; Ranjan Desh; Mohammad Zubair

In this paper we propose a parallel algorithm for the planted motif problem that arises in computational biology. A variety of algorithms have been proposed in the literature to solve this problem. The drawback of all these algorithms is that they have been designed to work on serial computers; and are not suitable for parallelization on current multicore architectures. We have implemented the proposed algorithm on a 4 Quad-Core Intel Xeon X5550 2.67GHz processor for a total of 16 cores. We compare our performance results with the best performance results reported in the literature; and showed that the performance of our algorithm scales linearly with the number of cores. We also solved the (21, 8) challenging instance on 16 cores in 6.9 hrs.

international conference on big data | 2014

ParK: An efficient algorithm for k-core decomposition on multicore processors

Naga Shailaja Dasari; Ranjan Desh; Mohammad Zubair

The k-core of a graph is the largest induced subgraph with minimum degree k. The k-core decomposition is to find the core number of each vertex in a graph, which is the largest value of k that the vertex belongs to a k-core. k-core decomposition has applications in many areas including network analysis, computational biology and graph visualization. The primary reason for it being widely used is the availability of an O(n + m) algorithm. The algorithm was proposed by Batagelj and Zaversnik and is considered the state-of-the-art algorithm for k-core decomposition. However, the algorithm is not suitable for parallelization and to the best of our knowledge there is no algorithm proposed for k-core decomposition on multicore processors. Also, the algorithm has not been experimentally analyzed for large graphs. Since the working set size of the algorithm is large, and the access pattern is highly random, it can be inefficient for large graphs. In this paper, we present an experimental analysis of the algorithm of Batagelj and Zaversnik and propose a new algorithm, ParK, that significantly reduces the working set size and minimizes the random accesses. We provide an experimental analysis of the algorithm using graphs with up to 65 million vertices and 1.8 billion edges. We compare the ParK algorithm with state-of-the-art algorithm and show that it is up to 6 times faster. We also provide a parallel methodology and show that the algorithm is amenable to parallelization on multicore architectures. We ran our experiments on a 4 socket Nehalem-EX processor which has 8 cores per socket and show that the algorithm scales up to 21 times using 32 cores.

Scientific Programming | 2009

Evaluating multicore algorithms on the unified memory model

John E. Savage; Mohammad Zubair

One of the challenges to achieving good performance on multicore architectures is the effective utilization of the underlying memory hierarchy. While this is an issue for single-core architectures, it is a critical problem for multicore chips. In this paper, we formulate the unified multicore model (UMM) to help understand the fundamental limits on cache performance on these architectures. The UMM seamlessly handles different types of multiple-core processors with varying degrees of cache sharing at different levels. We demonstrate that our model can be used to study a variety of multicore architectures on a variety of applications. In particular, we use it to analyze an option pricing problem using the trinomial model and develop an algorithm for it that has near-optimal memory traffic between cache levels. We have implemented the algorithm on a two Quad-Core Intel Xeon 5310 1.6 GHz processors (8 cores). It achieves a peak performance of 19.5 GFLOPs, which is 38% of the theoretical peak of the multicore system. We demonstrate that our algorithm outperforms compiler-optimized and auto-parallelized code by a factor of up to 7.5.

computing and combinatorics conference | 2011

Strong I/O lower bounds for binomial and FFT computation graphs

Desh Ranjan; John E. Savage; Mohammad Zubair

Processors on most of the modern computing devices have several levels of memory hierarchy. To obtain good performance on these processors it is necessary to design algorithms that minimize I/O traffic to slower memories in the hierarchy. In this paper, we propose a new technique, the boundary flow technique, for deriving lower bounds on the memory traffic complexity of problems in a two-level memory hierarchy architectures. The boundary flow technique relies on identifying sub-computation structure corresponding to equal computations with a minimum number of boundary vertices, which in turn is related to the vertex isoperimetric parameter of a computation graph. We demonstrate that this technique results in stronger lower bounds for memory traffic on memory hierarchy architectures for well-known computation structures: the binomial computation graphs and FFT computation graphs.

international conference on parallel processing | 2013

An Efficient Deterministic Parallel Algorithm for Adaptive Multidimensional Numerical Integration on GPUs

Kamesh Arumugam; Alexander Godunov; Desh Ranjan; Balsa Terzic; Mohammad Zubair

Recent development in Graphics Processing Units (GPUs) has enabled a new possibility for highly efficient parallel computing in science and engineering. Their massively parallel architecture makes GPUs very effective for algorithms where processing of large blocks of data can be executed in parallel. Multidimensional integration has important applications in areas like computational physics, plasma physics, computational fluid dynamics, quantum chemistry, molecular dynamics and signal processing. The computationally intensive nature of multidimensional integration requires a high-performance implementation. In this study, we present an efficient deterministic parallel algorithm for adaptive multidimensional numerical integration on GPUs. Various optimization techniques are applied to maximize the utilization of the GPU. GPU-based implementation outperforms the best known sequential methods and achieves a speed-up of up to 100. It also shows good scalability with the increase in dimensionality.

International Journal of Foundations of Computer Science | 2012

VERTEX ISOPERIMETRIC PARAMETER OF A COMPUTATION GRAPH

Desh Ranjan; Mohammad Zubair

Let G = (V,E) be a computation graph, which is a directed graph representing a straight line computation and S ⊂ V. We say a vertex v is an input vertex for S if there is an edge (v, u) such that v ∉ S and u ∈ S. We say a vertex u is an output vertex for S if there is an edge (u, v) such that u ∈ S and v ∉ S. A vertex is called a boundary vertex for a set S if it is either an input vertex or an output vertex for S. We consider the problem of determining the minimum value of boundary size of S over all sets of size M in an infinite directed grid. This problem is related to the vertex isoperimetric parameter of a graph, and is motivated by the need for deriving a lower bound for memory traffic for a computation graph representing a financial application. We first extend the notion of vertex isoperimetric parameter for undirected graphs to computation graphs, and then provide a complete solution for this problem for all M. In particular, we show that a set S of size M = 3k2 + 3k + 1 vertices of an infinite directed grid, the boundary size must be at least 6k + 3, and this is obtained when the vertices in S are arranged in a regular hexagonal shape with side k + 1.

ieee international conference on high performance computing, data, and analytics | 2013

A memory efficient algorithm for adaptive multidimensional integration with multiple GPUs

Kamesh Arumugam; Alexander Godunov; Desh Ranjan; Balsa Terzic; Mohammad Zubair

We present a memory-efficient algorithm and its implementation for solving multidimensional numerical integration on a cluster of compute nodes with multiple GPU devices per node. The effective use of shared memory is important for improving the performance on GPUs, because of the bandwidth limitation of the global memory. The best known sequential algorithm for multidimensional numerical integration CUHRE uses a large dynamic heap data structure which is accessed frequently. Devising a GPU algorithm that caches a part of this data structure in the shared memory so as to minimizes global memory access is a challenging task. The algorithm presented here addresses this problem. Furthermore we propose a technique to scale this algorithm to multiple GPU devices. The algorithm was implemented on a cluster of Intel® Xeon® CPU X5650 compute nodes with 4 Tesla M2090 GPU devices per node. We observed a speedup of up to 240 on a single GPU device as compared to a speedup of 70 when memory optimization was not used. On a cluster of 6 nodes (24 GPU devices) we were able to obtain a speedup of up to 3250. All speedups here are with reference to the sequential implementation running on the compute node.

European Consortium for Mathematics in Industry | 2014

BLAS Extensions for Algebraic Pricing Methods

Claudio Albanese; Paolo Regondi; Mohammad Zubair

PDE pricing methods such as backward and forward induction are typically implemented as unconditionally marginally stable algorithms in double precision for individual transactions. In this paper, we reconsider this strategy and argue that optimal GPU implementations should be based on a quite different strategy involving higher level BLAS routines. We argue that it is advantageous to use conditionally strongly stable algorithms in single precision and to price concurrently sub-portfolios of similar transactions. To support these operator algebraic methods, we propose some BLAS extensions. CUDA implementations of our extensions turn out to be significantly faster than implementations based on standard cuBLAS. The key to the performance gain of our implementation is in the efficient utilization of the memory system of the new GPU architecture.

Explore More