Is this you? Create Your Porfile

Abhinav Sarje

Lawrence Berkeley National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Abhinav Sarje is active.

Explore More

Publication

Featured researches published by Abhinav Sarje.

Journal of Applied Crystallography | 2013

HipGISAXS: a high-performance computing code for simulating grazing-incidence X-ray scattering data

Slim Chourou; Abhinav Sarje; Xiaoye S. Li; Elaine R. Chan; Alexander Hexemer

This article describes the development of a flexible grazing-incidence small-angle X-ray scattering (GISAXS) simulation code in the framework of the distorted wave Born approximation that effectively utilizes the parallel processing power provided by graphics processors and multicore processors. The code, entitled High-Performance GISAXS, computes the GISAXS image for any given superposition of user-defined custom shapes or morphologies in a material and for various grazing-incidence angles and sample orientations. These capabilities permit treatment of a wide range of possible sample structures, including multilayered polymer films and nanoparticles on top of or embedded in a substrate or polymer film layers. In cases where the material displays regions of significant refractive index contrast, an algorithm has been implemented to perform a slicing of the sample and compute the averaged refractive index profile to be used as the reference geometry of the unperturbed system. A number of case studies are presented, which demonstrate good agreement with the experimental data for a variety of polymer and hybrid polymer/nanoparticle composite materials. The parallelized simulation code is well suited for addressing the analysis efforts required by the increasing amounts of GISAXS data being produced by high-speed detectors and ultrafast light sources.

IEEE Transactions on Parallel and Distributed Systems | 2010

Parallel Information-Theory-Based Construction of Genome-Wide Gene Regulatory Networks

Jaroslaw Zola; Maneesha Aluru; Abhinav Sarje; Srinivas Aluru

Constructing genome-wide gene regulatory networks from large-scale gene expression data is an important problem in systems biology. While several techniques have been developed, none of them is parallel, and they do not scale to the whole genome level or incorporate the largest data sets, particularly with rigorous statistical techniques. In this paper, we present a parallel method integrating mutual information, data processing inequality, and statistical testing to detect significant dependencies between genes, and efficiently exploit parallelism inherent in such computations. We present a new method to carry out permutation testing for assessing statistical significance of interactions, while reducing its computational complexity by a factor of Θ(n2), where n is the number of genes. Using both synthetic and known regulatory networks, we show that our method produces networks of quality similar to ARACNe, a widely used mutual-information-based method. We further explore the use of accelerators for gene network construction by presenting a parallelization on a cluster of IBM Cell blades. We exploit parallelization across multiple Cells, multiple cores within each Cell, and vector units within the cores to develop a high-performance implementation that effectively addresses the scaling problem. We report the first inference of a plant whole genome network by constructing a 15,222 gene network of the plant Arabidopsis thaliana from 3,137 microarray experiments in 30 minutes on a 2,048-CPU IBM Blue Gene/L, and in 2 hours and 25 minutes on a 8-node Cell blade cluster.

international parallel and distributed processing symposium | 2008

Parallel biological sequence alignments on the Cell Broadband Engine

Abhinav Sarje; Srinivas Aluru

Sequence alignment and its many variants are a fundamental tool in computational biology. There is considerable recent interest in using the cell broadband engine, a heterogenous multi-core chip that provides high performance, for biological applications. However, work so far has been limited to computing optimal alignment scores using quadratic space under the basic global/local alignment algorithm. In this paper, we present a comprehensive study of developing sequence alignment algorithms on the Cell exploiting its thread and data level parallelism features. First, we develop a cell implementation that computes optimal alignments and adopts Hirschbergs linear space technique. The former is essential as merely computing optimal alignment scores is not useful while the latter is needed to permit alignments of longer sequences. We then present cell implementations of two advanced alignment techniques - spliced alignments and syntenic alignments. In a spliced alignment, consecutive non-overlapping portions of a sequence align with ordered non-overlapping, but non-consecutive portions of another sequence. Spliced alignments are useful in aligning mRNA sequences with corresponding genomic sequences to uncover gene structure. Syntenic alignments are used to discover conserved exons and other sequences between long genomic sequences from different organisms. We present experimental results for these three types of alignments on the Cell BE and report speedups of about 4 on six SPUs on the Playstation 3, when compared to the respective best serial algorithms on the Cell BE and the Pentium 4 processor.

IEEE Transactions on Parallel and Distributed Systems | 2009

Parallel Genomic Alignments on the Cell Broadband Engine

Abhinav Sarje; Srinivas Aluru

Genomic alignments, as a means to uncover evolutionary relationships among organisms, are a fundamental tool in computational biology. There is considerable recent interest in using the Cell Broadband Engine, a heterogeneous multicore chip that provides high performance, for biological applications. However, work in genomic alignments so far has been limited to computing optimal alignment scores using quadratic space for the basic global/local alignment problem. In this paper, we present a comprehensive study of developing alignment algorithms on the Cell, exploiting its thread and data level parallelism features. First, we develop a parallel implementation on the Cell that computes optimal alignments and adopts Hirschbergs linear space technique. The former is essential, as merely computing optimal alignment scores is not useful, while the latter is needed to permit alignments of longer sequences. We then present Cell implementations of two advanced alignment techniques-spliced alignments and syntenic alignments. Spliced alignments are useful in aligning mRNA sequences with corresponding genomic sequences to uncover the gene structure. Syntenic alignments are used to discover conserved exons and other sequences between long genomic sequences from different organisms. We present experimental results for these three types of alignments on 16 Synergistic Processing Elements of the IBM QS20 dual-Cell blade system.

ieee international conference on high performance computing data and analytics | 2016

Evaluating and optimizing the NERSC workload on Knights Landing

Taylor Barnes; Brandon Cook; Jack Deslippe; Douglas W. Doerfler; Brian Friesen; Yun He; Thorsten Kurth; Tuomas Koskela; Mathieu Lobet; Tareq M. Malas; Leonid Oliker; Andrey Ovsyannikov; Abhinav Sarje; Jean-Luc Vay; Henri Vincenti; Samuel Williams; Pierre Carrier; Nathan Wichmann; Marcus Wagner; Paul R. C. Kent; Christopher Kerr; John M. Dennis

NERSC has partnered with 20 representative application teams to evaluate performance on the Xeon-Phi Knights Landing architecture and develop an application-optimization strategy for the greater NERSC workload on the recently installed Cori system. In this article, we present early case studies and summarized results from a subset of the 20 applications highlighting the impact of important architecture differences between the Xeon-Phi and traditional Xeon processors. We summarize the status of the applications and describe the greater optimization strategy that has formed.

IEEE Transactions on Parallel and Distributed Systems | 2011

Accelerating Pairwise Computations on Cell Processors

Abhinav Sarje; Jaroslaw Zola; Srinivas Aluru

Direct computation of all pairwise distances or interactions is a fundamental problem that arises in many application areas including particle or atomistic simulations, fluid dynamics, computational electromagnetics, materials science, genomics and systems biology, and clustering and data mining. In this paper, we present methods for performing such pairwise computations efficiently in parallel on Cell processors. This problem is particularly challenging on the Cell processor due to the small sized Local Stores of the Synergistic Processing Elements, the main computational cores of the processor. We present techniques for different variants of this problem including those with large number of entities or when the dimensionality of the information per entity is large. We demonstrate our methods in the context of multiple applications drawn from fluid dynamics, materials science and systems biology, and present detailed experimental results. Our software library is an open source and can be readily used by application scientists to accelerate pairwise computations using Cell accelerators.

parallel computing | 2013

All-pairs computations on many-core graphics processors

Abhinav Sarje; Srinivas Aluru

Developing high-performance applications on emerging multi- and many-core architectures requires efficient mapping techniques and architecture-specific tuning methodologies to realize performance closer to their peak compute capability and memory bandwidth. In this paper, we develop architecture-aware methods to accelerate all-pairs computations on many-core graphics processors. Pairwise computations occur frequently in numerous application areas in scientific computing. While they appear easy to parallelize due to the independence of computing each pairwise interaction from all others, development of techniques to address multi-layered memory hierarchies, mapping within the restrictions imposed by the small and low-latency on-chip memories, striking the right balanced between concurrency, reuse and memory traffic etc., are crucial to obtain high-performance. We present a hierarchical decomposition scheme for GPUs based on decomposition of the output matrix and input data. We demonstrate that a careful tuning of the involved set of decomposition parameters is essential to achieve high efficiency on the GPUs. We also compare the performance of our strategies with an implementation on the STI Cell processor as well as multi-core CPU parallelizations using OpenMP and Intel Threading Building Blocks.

international conference on conceptual structures | 2015

Parallel performance optimizations on unstructured mesh-based simulations

Abhinav Sarje; Sukhyun Song; Douglas W. Jacobsen; Kevin A. Huck; Jeffrey K. Hollingsworth; Allen D. Malony; Samuel Williams; Leonid Oliker

Abstract This paper addresses two key parallelization challenges the unstructured mesh-based ocean modeling code, MPAS-Ocean, which uses a mesh based on Voronoi tessellations: (1) load imbalance across processes, and (2) unstructured data access patterns, that inhibit intra- and inter-node performance. Our work analyzes the load imbalance due to naive partitioning of the mesh, and develops methods to generate mesh partitioning with better load balance and reduced communication. Furthermore, we present methods that minimize both inter- and intra- node data movement and maximize data reuse. Our techniques include predictive ordering of data elements for higher cache efficiency, as well as communication reduction approaches. We present detailed performance data when running on thousands of cores using the Cray XC30 supercomputer and show that our optimization strategies can exceed the original performance by over 2×. Additionally, many of these solutions can be broadly applied to a wide variety of unstructured grid-based computations.

ieee international conference on high performance computing data and analytics | 2012

Massively parallel X-ray scattering simulations

Abhinav Sarje; Xiaoye S. Li; Slim Chourou; Elaine R. Chan; Alexander Hexemer

Although present X-ray scattering techniques can provide tremendous information on the nano-structural properties of materials that are valuable in the design and fabrication of energy-relevant nano-devices, a primary challenge remains in the analyses of such data. In this paper we describe a high-performance, flexible, and scalable Grazing Incidence Small Angle X-ray Scattering simulation algorithm and codes that we have developed on multi-core/CPU and many-core/GPU clusters. We discuss in detail our implementation, optimization and performance on these platforms. Our results show speedups of ~125x on a Fermi-GPU and ~20x on a Cray-XE6 24-core node, compared to a sequential CPU code, with near linear scaling on multi-node clusters. To our knowledge, this is the first GISAXS simulation code that is flexible to compute scattered light intensities in all spatial directions allowing full reconstruction of GISAXS patterns for any complex structures and with highresolutions while reducing simulation times from months to minutes.

international conference on parallel processing | 2009

Constructing Gene Regulatory Networks on Clusters of Cell Processors

Jaroslaw Zola; Abhinav Sarje; Srinivas Aluru

Constructing genome-wide gene regulatory networks from a large number of gene expression profile measurements is an important problem in systems biology. While several techniques have been developed, none of them is parallel, and they lack the capability to scale to the whole-genome level or incorporate the largest data sets, particularly with rigorous statistical testing. To address this problem, we recently developed a mutual information theory based parallel method for gene network reconstruction. In this paper, we extend this work to a cluster of Cell processors. We use parallelization across multiple Cells, multiple cores within each Cell, and vector units within the cores to develop a high performance implementation that effectively addresses the scaling problem. We present experimental results comparing the Cell implementation with a standard uniprocessor implementation and an implementation on a conventional supercomputer. Finally, we report the construction of a large 15,203 gene network of the plant Arabidopsis thaliana from 2,996 microarray experiments on a 8-node Cell blade cluster in 2 hours and 24 minutes.

Explore More