Olivia Choudhury | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Olivia Choudhury is active.

Explore More

Publication

Featured researches published by Olivia Choudhury.

Artificial Intelligence Review | 2012

Hyperspectral image classification incorporating bacterial foraging-optimized spectral weighting

Ankush Chakrabarty; Olivia Choudhury; Pallab Sarkar; Avishek Paul; Debarghya Sarkar

The present paper describes the development of a hyperspectral image classification scheme using support vector machines (SVM) with spectrally weighted kernels. The kernels are designed during the training phase of the SVM using optimal spectral weights estimated using the Bacterial Foraging Optimization (BFO) algorithm, a popular modern stochastic optimization algorithm. The optimized kernel functions are then in the SVM paradigm for bi-classification of pixels in hyperspectral images. The effectiveness of the proposed approach is demonstrated by implementing it on three widely used benchmark hyperspectral data sets, two of which were taken over agricultural sites at Indian Pines, Indiana, and Salinas Valley, California, by the Airborne Visible Infra-Red Imaging Spectrometer (AVIRIS) at NASA’s Jet Propulsion Laboratory. The third dataset was acquired using the Reflective Optical System Imaging Spectrometer (ROSIS) over an urban scene at Pavia University, Italy to demonstrate the efficacy of the proposed approach in an urban scenario as well as with agricultural data. Classification errors for One-Against-One (OAO) and classification accuracies for One-Against-All (OAA) schemes were computed and compared to other methods developed in recent times. Finally, the use of the BFO-based technique is recommended owing to its superior performance, in comparison to other contemporary stochastic bio-inspired algorithms.

cluster computing and the grid | 2014

Accelerating Comparative Genomics Workflows in a Distributed Environment with Optimized Data Partitioning

Olivia Choudhury; Nicholas L. Hazekamp; Douglas Thain; Scott J. Emrich

The advent of new sequencing technology has generated massive amounts of biological data at unprecedented rates. High-throughput bioinformatics tools are required to keep pace with this. Here, we implement a workflow-based model for parallelizing the data intensive task of genome alignment and variant calling with BWA and GATKs Haplotype Caller. We explore different approaches of partitioning data and how each affect the run time. We observe granularity-based partitioning for BWA and alignment-based partitioning for Halo type Caller to be the optimal choices for the pipeline. We identify the various challenges encountered while developing such an application and provide an insight into addressing them. We report significant performance improvements, from 12 days to 4 hours, while running the BWA-GATK pipeline using 100 nodes for analyzing high-coverage oak tree data.

Molecular Biology Reports | 2013

An integrated pathway system modeling of Saccharomyces cerevisiae HOG pathway: a Petri net based approach.

Namrata Tomar; Olivia Choudhury; Ankush Chakrabarty; Rajat K. De

Biochemical networks comprise many diverse components and interactions between them. It has intracellular signaling, metabolic and gene regulatory pathways which are highly integrated and whose responses are elicited by extracellular actions. Previous modeling techniques mostly consider each pathway independently without focusing on the interrelation of these which actually functions as a single system. In this paper, we propose an approach of modeling an integrated pathway using an event-driven modeling tool, i.e., Petri nets (PNs). PNs have the ability to simulate the dynamics of the system with high levels of accuracy. The integrated set of signaling, regulatory and metabolic reactions involved in Saccharomyces cerevisiae’s HOG pathway has been collected from the literature. The kinetic parameter values have been used for transition firings. The dynamics of the system has been simulated and the concentrations of major biological species over time have been observed. The phenotypic characteristics of the integrated system have been investigated under two conditions, viz., under the absence and presence of osmotic pressure. The results have been validated favorably with the existing experimental results. We have also compared our study with the study of idFBA (Lee et al., PLoS Comput Biol 4:e1000–e1086, 2008) and pointed out the differences between both studies. We have simulated and monitored concentrations of multiple biological entities over time and also incorporated feedback inhibition by Ptp2 which has not been included in the idFBA study. We have concluded that our study is the first to the best of our knowledge to model signaling, metabolic and regulatory events in an integrated form through PN model framework. This study is useful in computational simulation of system dynamics for integrated pathways as there are growing evidences that the malfunctioning of the interplay among these pathways is associated with disease.

BMC Genomics | 2017

High-quality genetic mapping with ddRADseq in the non-model tree Quercus rubra

Arpita Konar; Olivia Choudhury; Rebecca Bullis; Lauren Fiedler; Jacqueline M. Kruser; Melissa T. Stephens; Oliver Gailing; Scott E. Schlarbaum; Mark V. Coggeshall; Margaret Staton; John E. Carlson; Scott J. Emrich; Jeanne Romero-Severson

BackgroundRestriction site associated DNA sequencing (RADseq) has the potential to be a broadly applicable, low-cost approach for high-quality genetic linkage mapping in forest trees lacking a reference genome. The statistical inference of linear order must be as accurate as possible for the correct ordering of sequence scaffolds and contigs to chromosomal locations. Accurate maps also facilitate the discovery of chromosome segments containing allelic variants conferring resistance to the biotic and abiotic stresses that threaten forest trees worldwide. We used ddRADseq for genetic mapping in the tree Quercus rubra, with an approach optimized to produce a high-quality map. Our study design also enabled us to model the results we would have obtained with less depth of coverage.ResultsOur sequencing design produced a high sequencing depth in the parents (248×) and a moderate sequencing depth (15×) in the progeny. The digital normalization method of generating a de novo reference and the SAMtools SNP variant caller yielded the most SNP calls (78,725). The major drivers of map inflation were multiple SNPs located within the same sequence (77% of SNPs called). The highest quality map was generated with a low level of missing data (5%) and a genome-wide threshold of 0.025 for deviation from Mendelian expectation. The final map included 849 SNP markers (1.8% of the 78,725 SNPs called). Downsampling the individual FASTQ files to model lower depth of coverage revealed that sequencing the progeny using 96 samples per lane would have yielded too few SNP markers to generate a map, even if we had sequenced the parents at depth 248×.ConclusionsThe ddRADseq technology produced enough high-quality SNP markers to make a moderately dense, high-quality map. The success of this project was due to high depth of coverage of the parents, moderate depth of coverage of the progeny, a good framework map, an optimized bioinformatics pipeline, and rigorous premapping filters. The ddRADseq approach is useful for the construction of high-quality genetic maps in organisms lacking a reference genome if the parents and progeny are sequenced at sufficient depth. Technical improvements in reduced representation sequencing (RRS) approaches are needed to reduce the amount of missing data.

international conference on e-science | 2015

Scaling Up Bioinformatics Workflows with Dynamic Job Expansion: A Case Study Using Galaxy and Makeflow

Nicholas L. Hazekamp; Joseph Sarro; Olivia Choudhury; Sandra Gesing; Scott J. Emrich; Douglas Thain

Logical workflow management systems provide a user-friendly portal through which data can be processed using a sequence of standard tools. These logical workflows are a natural way to express the high level intent of the user, and to share the structure and the results with other users. However, logical workflows are not necessarily suited to expressing parallelism for very large runs. As the amount of data is scaled up, the run time of each node in the logical workflow may become extreme. We propose a technique of job expansion to solve this problem. When job expansion is applied to a logical workflow, each node in the workflow is itself expanded into a large performance workflow that may consist of hundreds to thousands of tasks that can be executed in parallel, thus enabling high concurrency and scalability. From the users perspective, nothing has changed and the logical workflow remains in its original form. To demonstrate this technique, we have applied job expansion to a selection of bioinformatics applications running in the Galaxy workflow management system. Each job in the workflow is expanded into a highly parallel workflow executed using Makeflow, which is well suited to express high levels of parallelism. Work Queue is then utilized for execution because of its ability to quickly dispatch tasks and cache files for later reuse. After applying job expansion, we improve the execution time of BWA 18X and GATK 402X, with a total speedup of 61.5X on the workflow. We also take a look at the systems behavior since its launch to analyze its effectiveness.

international conference on bioinformatics | 2016

HAPI-Gen: Highly Accurate Phasing and Imputation of Genotype Data

Olivia Choudhury; Ankush Chakrabarty; Scott J. Emrich

High-throughput sequencing is continuously generating large volumes of genotype data. Haplotype phasing can be an effective means to deal with large numbers of variants in analyses such as genome-wide association studies (GWAS), but the quality of such phasing is degraded by missing genotypes. Most existing imputation tools rely on reference genotype panels and a linear order of markers in the form of a physical or genetic map. A large number of genomics projects, however, do not have access to these resources. In this paper, we introduce HAPI-Gen, a Highly Accurate Phasing and Imputation tool for Genotype data that does not require reference genotype panels or the global order of markers, thereby filling an important need for improving phasing in less studied organisms. Other major advantages of HAPI-Gen include low runtime, reduced memory consumption, and easy parallelization. We test HAPI-Gen on the malaria parasite Plasmodium falciparum and three plant datasets of grape, apple, and maize. For varying data sizes and proportions of missing genotypes, HAPI-Gen consistently performs better than the leading tools Beagle, IM-PUTE2, and LinkImpute in terms of accuracy, runtime, and memory usage.

international conference on cluster computing | 2015

Balancing Thread-Level and Task-Level Parallelism for Data-Intensive Workloads on Clusters and Clouds

Olivia Choudhury; Dinesh Rajan; Nicholas L. Hazekamp; Sandra Gesing; Douglas Thain; Scott J. Emrich

The runtime configuration of parallel and distributed applications remains a mysterious art. To tune an application on a particular system, the end-user must choose the number of machines, the number of cores per task, the data partitioning strategy, and so on, all of which result in a combinatorial explosion of choices. While one might try to exhaustively evaluate all choices in search of the optimal, the end users goal is simply to run the application once with reasonable performance by avoiding terrible configurations. To address this problem, we present a hybrid technique based on regression models for tuning data intensive bioinformatics applications: the sequential computational kernel is characterized empirically and then incorporated into an ab initio model of the distributed system. We demonstrate this technique on the commonly-used applications BWA, Bowtie2, and BLASR and validate the accuracy of our proposed models on clouds and clusters.

IEEE Transactions on Parallel and Distributed Systems | 2018

Combining Static and Dynamic Storage Management for Data Intensive Scientific Workflows

Nicholas L. Hazekamp; Nathaniel Kremer-Herman; Benjamín Tovar; Haiyan Meng; Olivia Choudhury; Scott J. Emrich; Douglas Thain

Workflow management systems are widely used to express and execute highly parallel applications. For data-intensive workflows, storage can be the constraining resource: The number of tasks running at once must be artificially limited to not overflow the space available in the filesystem. It is all too easy for a user to dispatch a workflow which consumes all available storage and disrupts all system users. To address these issues, we present a three-tiered approach to workflow storage management: (1) A static analysis algorithm which analyzes the storage needs of a workflow before execution, giving a realistic prediction of success or failure. (2) An online storage management algorithm which accounts for the storage needed by future tasks to avoid deadlock at runtime. (3) A task containment system which limits storage consumption of individual tasks, enabling the strong guarantees of the static analysis and dynamic management algorithms. We demonstrate the application of these techniques on three complex workflows.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2017

Highly Accurate and Efficient Data-Driven Methods For Genotype Imputation

Olivia Choudhury; Ankush Chakrabarty; Scott J. Emrich

High-throughput sequencing techniques have generated massive quantities of genotype data. Haplotype phasing has proven to be a useful and effective method for analyzing these data. However, the quality of phasing is undermined due to missing information. Imputation provides an effective means of improving the underlying genotype information. For model organisms, imputation can rely on an available reference genotype panel and a physical or genetic map. For non-model organisms, which often do not have a genotype panel, it is important to design an imputation technique that does not rely on reference data. Here, we present Accurate Data-Driven Imputation Technique (ADDIT), which is composed of two data-driven algorithms capable of handling data generated from model and non-model organisms. The non-model variant of ADDIT (referred to as ADDIT-NM) employs statistical inference methods to impute missing genotypes, whereas the model variant (referred to as ADDIT-M) leverages a supervised learning-based approach for imputation. We demonstrate that both variants of ADDIT are more accurate, faster, and require less memory than leading state-of-the-art imputation tools using model (human) and non-model (maize, apple, and grape) genotype data. Software Availability: The source code of ADDIT and test data sets are available at https://github.com/NDBL/ADDIT.

cluster computing and the grid | 2014

Expanding Tasks of Logical Workflows Into Independent Workflows for Improved Scalability

Nicholas L. Hazekamp; Olivia Choudhury; Sandra Gesing; Scott J. Emrich; Douglas Thain

Workflow Management Systems, such as Galaxy and Taverna, provide a portal through which data can be processed using a sequence of different tools. This sequence allows for the creation of a logical workflow that describes the process. However, when the data workload becomes large enough the time spent in each logical step increases making it difficult to run the workflow fast and efficiently. The proposed solutions is to use task level expansion. Task expansion aims to take each step of the logical workflow and expand it into a new self-contained workflow. These workflows would allow for greater scalability and concurrency by creating more tasks. The resulting workflows will be used indistinguishably from the original tool, but perform more quickly and efficiently. The concept was applied to the BWA tool in Galaxy and we were able to see a 7.36 times speedup in runtime on our 32 GB dataset.

Explore More