Is this you? Create Your Porfile

Ariful Azad

Lawrence Berkeley National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ariful Azad is active.

Explore More

Publication

Featured researches published by Ariful Azad.

international parallel and distributed processing symposium | 2015

Parallel Triangle Counting and Enumeration Using Matrix Algebra

Ariful Azad; Aydin Buluç; John R. Gilbert

Triangle counting and enumeration are important kernels that are used to characterize graphs. They are also used to compute important statistics such as clustering coefficients. We provide a simple exact algorithm that is based on operations on sparse adjacency matrices. By parallelizing the individual sparse matrix operations, we achieve a parallel algorithm for triangle counting. The algorithm is generalizable to triangle enumeration by modifying the semiring that underlies the matrix algebra. We present a new primitive, masked matrix multiplication, that can be beneficial especially for the enumeration case. We provide results from an initial implementation for the counting case along with various optimizations for communication reduction and load balance.

international parallel and distributed processing symposium | 2012

Multithreaded Algorithms for Maximum Matching in Bipartite Graphs

Ariful Azad; Mahantesh Halappanavar; Sivasankaran Rajamanickam; Erik G. Boman; Arif M. Khan; Alex Pothen

We design, implement, and evaluate algorithms for computing a matching of maximum cardinality in a bipartite graph on multicore and massively multithreaded computers. As computers with larger numbers of slower cores dominate the commodity processor market, the design of multithreaded algorithms to solve large matching problems becomes a necessity. Recent work on serial algorithms for the matching problem has shown that their performance is sensitive to the order in which the vertices are processed for matching. In a multithreaded environment, imposing a serial order in which vertices are considered for matching would lead to loss of concurrency and performance. But this raises the question: {\em Would parallel matching algorithms on multithreaded machines improve performance over a serial algorithm?}We answer this question in the affirmative. We report efficient multithreaded implementations of three classes of algorithms based on their manner of searching for augmenting paths: breadth-first-search, depth-first-search, and a combination of both. The Karp-Sipser initialization algorithm is used to make the parallel algorithms practical. We report extensive results and insights using three shared-memory platforms (a 48-core AMD Opteron, a 32-coreIntel Nehalem, and a 128-processor Cray XMT) on a representative set of real-world and synthetic graphs. To the best of our knowledge, this is the first study of augmentation-based parallel algorithms for bipartite cardinality matching that demonstrates good speedups on multithreaded shared memory multiprocessors.

SIAM Journal on Scientific Computing | 2016

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

Ariful Azad; Grey Ballard; Aydin Buluç; James Demmel; Laura Grigori; Oded Schwartz; Sivan Toledo; Samuel Williams

Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos-Renyi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first ever implementation of the 3D SpGEMM formulation that also exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research.

BMC Bioinformatics | 2012

Matching phosphorylation response patterns of antigen-receptor-stimulated T cells via flow cytometry.

Ariful Azad; Saumyadipta Pyne; Alex Pothen

BackgroundWhen flow cytometric data on mixtures of cell populations are collected from samples under different experimental conditions, computational methods are needed (a) to classify the samples into similar groups, and (b) to characterize the changes within the corresponding populations due to the different conditions. Manual inspection has been used in the past to study such changes, but high-dimensional experiments necessitate developing new computational approaches to this problem. A robust solution to this problem is to construct distinct templates to summarize all samples from a class, and then to compare these templates to study the changes across classes or conditions.ResultsWe designed a hierarchical algorithm, flowMatch, to first match the corresponding clusters across samples for producing robust meta-clusters, and to then construct a high-dimensional template as a collection of meta-clusters for each class of samples. We applied the algorithm on flow cytometry data obtained from human blood cells before and after stimulation with anti-CD3 monoclonal antibody, which is reported to change phosphorylation responses of memory and naive T cells. The flowMatch algorithm is able to construct representative templates from the samples before and after stimulation, and to match corresponding meta-clusters across templates. The templates of the pre-stimulation and post-stimulation data corresponding to memory and naive T cell populations clearly show, at the level of the meta-clusters, the overall phosphorylation shift due to the stimulation.ConclusionsWe concisely represent each class of samples by a template consisting of a collection of meta-clusters (representative abstract populations). Using flowMatch, the meta-clusters across samples can be matched to assess overall differences among the samples of various phenotypes or time-points.

workshop on algorithms in bioinformatics | 2010

Identifying rare cell populations in comparative flow cytometry

Ariful Azad; Johannes Langguth; Youhan Fang; Alan Qi; Alex Pothen

Multi-channel, high throughput experimental methodologies for flow cytometry are transforming clinical immunology and hematology, and require the development of algorithms to analyze the high-dimensional, large-scale data. We describe the development of two combinatorial algorithms to identify rare cell populations in data from mice with acute promyelocytic leukemia. The flow cytometry data is clustered, and then samples from the leukemic, pre-leukemic, and Wild Type mice are compared to identify clusters belonging to the diseased state. We describe three metrics on the clustered data that help in identifying rare populations. We formulate a generalized edge cover approach in a bipartite graph model to directly compare clusters in two samples to identify clusters belonging to one but not the other sample. For detecting rare populations common to many diseased samples but not to the Wild Type, we describe a clique-based branch and bound algorithm. We provide statistical justification of the significance of the rare populations.

international parallel and distributed processing symposium | 2016

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication

Penporn Koanantakool; Ariful Azad; Aydin Buluç; Dmitriy Morozov; S. Oh; Leonid Oliker; Katherine A. Yelick

Multiplication of a sparse matrix with a dense matrix is a building block of an increasing number of applications in many areas such as machine learning and graph algorithms. However, most previous work on parallel matrix multiplication considered only both dense or both sparse matrix operands. This paper analyzes the communication lower bounds and compares the communication costs of various classic parallel algorithms in the context of sparse-dense matrix-matrix multiplication. We also present new communication-avoiding algorithms based on a 1D decomposition, called 1.5D, which - while suboptimal in dense-dense and sparse-sparse cases - outperform the 2D and 3D variants both theoretically and in practice for sparse-dense multiplication. Our analysis separates one-time costs from per iteration costs in an iterative machine learning context. Experiments demonstrate speedups up to 100x over a baseline 3D SUMMA implementation and show parallel scaling over 10 thousand cores.

parallel computing | 2014

On parallel push-relabel based algorithms for bipartite maximum matching

Johannes Langguth; Ariful Azad; Mahantesh Halappanavar; Fredrik Manne

We study multithreaded push-relabel based algorithms for computing maximum cardinality matching in bipartite graphs. Matching is a fundamental combinatorial problem with applications in a wide variety of problems in science and engineering. We are motivated by its use in the context of sparse linear solvers for computing the maximum transversal of a matrix. Other applications can be found in many fields such as bioinformatics (Azad et al., 2010) [4], scheduling (Timmer and Jess, 1995) [27], and chemical structure analysis (John, 1995) [14]. We implement and test our algorithms on several multi-socket multicore systems and compare their performance to state-of-the-art augmenting path-based serial and parallel algorithms using a test set comprised of a wide range of real-world instances. Building on several heuristics for enhancing performance, we demonstrate good scaling for the parallel push-relabel algorithm. We show that it is comparable to the best augmenting path-based algorithms for bipartite matching. To the best of our knowledge, this is the first extensive study of multithreaded push-relabel based algorithms. In addition to a direct impact on the applications using matching, the proposed algorithmic techniques can be extended to preflow-push based algorithms for computing maximum flow in graphs.

international conference on bioinformatics | 2013

Classifying Immunophenotypes With Templates From Flow Cytometry

Ariful Azad; Arif M. Khan; Bartek Rajwa; Saumyadipta Pyne; Alex Pothen

We describe an algorithm to dynamically classify flow cytometry data samples into several classes based on their immunophenotypes. Flow cytometry data consists of fluorescence measurements of several proteins that characterize different cell types in blood or cultured cell lines. Each sample is initially clustered to identify the cell populations present in it. Using a combinatorial dissimilarity measure between cell populations in samples, we compute meta-clusters that correspond to the same cell population across samples. The collection of meta-clusters in a class of samples then describes a template for that class. We organize the samples into a template tree, and use it to classify new samples into existing classes or create a new class if needed. We dynamically update the templates and their statistical parameters as new samples are classified, so that the new information is reflected in the classes. We use our dynamic classification algorithm to classify T cells that on stimulation with an antibody show increased abundance of the proteins SLP-76 and ZAP-70. These proteins are involved in a platform that assembles signaling proteins in the immune response. We also use the algorithm to show that variation in an immune subsystem between individuals is a larger effect than variation in multiple samples from one individual.

international parallel and distributed processing symposium | 2012

Multithreaded Algorithms for Matching in Graphs with Application to Data Analysis in Flow Cytometry

Ariful Azad; Alex Pothen

We study parallel algorithms for computing matchings in graphs and apply them to solve population registration problem from bio-imaging data. We have developed several classes of multithreaded algorithms for maximum cardinality matching and achieved good speedups on three shared memory machines on a representative set of large real-world and synthetic graphs. The parallel machines include processors that employ multithreading and cache (Intel Nehalem and AMD Opteron) and massively multithreading and flat memory model (Cray XMT). The bio-imaging application involves registering different cell populations across samples using flow cytometry data. The population registration problem is solved by a generalized edge cover, computed from a weighted matching. We have used this approach to differentiate leukemic cells from healthy ones, and to identify phosphorylation shifts in T cells due to stimulation with an antibody. In current work, we are adapting the concept of consistency used in multiple sequence alignments to the population registration problem for large sample sets.

international parallel and distributed processing symposium | 2017

A Work-Efficient Parallel Sparse Matrix-Sparse Vector Multiplication Algorithm

Ariful Azad; Aydin Buluç

We design and develop a work-efficient multithreaded algorithm for sparse matrix-sparse vector multiplication (SpMSpV) where the matrix, the input vector, and the output vector are all sparse. SpMSpV is an important primitive in the emerging GraphBLAS standard and is the workhorse of many graph algorithms including breadth-first search, bipartite graph matching, and maximal independent set. As thread counts increase, existing multithreaded SpMSpV algorithms can spend more time accessing the sparse matrix data structure than doing arithmetic. Our shared-memory parallel SpMSpV algorithm is work efficient in the sense that its total work is proportional to the number of arithmetic operations required. The key insight is to avoid each thread individually scan the list of matrix columns. Our algorithm is simple to implement and operates on existing column-based sparse matrix formats. It performs well on diverse matrices and vectors with heterogeneous sparsity patterns. A high-performance implementation of the algorithm attains up to 15x speedup on a 24-core Intel Ivy Bridge processor and up to 49x speedup on a 64-core Intel KNL manycore processor. In contrast to implementations of existing algorithms, the performance of our algorithm is sustained on a variety of different input types include matrices representing scale-free and high-diameter graphs.

Explore More