Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Raunaq Malhotra is active.

Publication


Featured researches published by Raunaq Malhotra.


Journal of Computational Biology | 2013

Mutant-Bin: Unsupervised Haplotype Estimation of Viral Population Diversity Without Reference Genome

Shruthi Prabhakara; Raunaq Malhotra; Raj Acharya; Mary Poss

High genetic variability in viral populations plays an important role in disease progression, pathogenesis, and drug resistance. The last few years has seen significant progress in the development of methods for reconstruction of viral populations using data from next-generation sequencing technologies. These methods identify the differences between individual haplotypes by mapping the short reads to a reference genome. Much less has been published about resolving the population structure when a reference genome is lacking or is not well-defined, which severely limits the application of these new technologies to resolve virus population structure. We describe a computational framework, called Mutant-Bin, for clustering individual haplotypes in a viral population and determining their prevalence based on a set of deep sequencing reads. The main advantages of our method are that: (i) it enables determination of the population structure and haplotype frequencies when a reference genome is lacking; (ii) the method is unsupervised-the number of haplotypes does not have to be specified in advance; and (iii) it identifies the polymorphic sites that co-occur in a subset of haplotypes and the frequency with which they appear in the viral population. The method was evaluated on simulated reads with sequencing errors and 454 pyrosequencing reads from HIV samples. Our method clustered a high percentage of haplotypes with low false-positive rates, even at low genetic diversity.


Computation | 2014

Computational and Statistical Analyses of Insertional Polymorphic Endogenous Retroviruses in a Non-Model Organism

Le Bao; Daniel Elleder; Raunaq Malhotra; Michael DeGiorgio; Theodora Maravegias; Lindsay M. Horvath; Laura Carrel; Colin M. Gillin; Tomáš Hron; Helena Fábryová; David R. Hunter; Mary Poss

Endogenous retroviruses (ERVs) are a class of transposable elements found in all vertebrate genomes that contribute substantially to genomic functional and structural diversity. A host species acquires an ERV when an exogenous retrovirus infects a germ cell of an individual and becomes part of the genome inherited by viable progeny. ERVs that colonized ancestral lineages are fixed in contemporary species. However, in some extant species, ERV colonization is ongoing, which results in variation in ERV frequency in the population. To study the consequences of ERV colonization of a host genome, methods are needed to assign each ERV to a location in a species’ genome and determine which individuals have acquired each ERV by descent. Because well annotated reference genomes are not widely available for all species, de novo clustering approaches provide an alternative to reference mapping that are insensitive to differences between query and reference and that are amenable to mobile element studies in both model and non-model organisms. However, there is substantial uncertainty in both identifying ERV genomic position and assigning each unique ERV integration site to individuals in a population. We present an analysis suitable for detecting ERV integration sites in species without the need for a reference genome. Our approach is based on improved de novo clustering methods and statistical models that take the uncertainty of assignment into account and yield a probability matrix of shared ERV integration sites among individuals. We demonstrate that polymorphic integrations of a recently identified endogenous retrovirus in deer reflect contemporary relationships among individuals and populations.


international conference on bioinformatics | 2015

A generalized lattice model for clustering metagenomic sequences

Manjari Mukhopadhyay; Raunaq Malhotra; Raj Acharya

Metagenomics involves the analysis of genomes of microorganisms sampled directly from their environment. Next Generation Sequencing (NGS) technologies allow a high-throughput sampling of small segments from genomes in the metagenome to generate a large number of reads. In order to study the properties and relationships of the microorganisms present, clustering of the sampled reads into groups of similar species is important. Clustering can be performed either by mapping the sampled reads to known sequencing databases, though this hinders the discovery of new species; or based on the inherent composition of the sampled reads. We propose a two-dimensional lattice based probabilistic model for clustering metagenomic datasets. The probability of a species in the metagenome is defined as a lattice model of probabilistic distributions over short sized genomic sequences (or words). The two dimensions denote distributions for different sizes and groups of words respectively. The lattice structure allows for additional support for a node from its neighbors when the probabilistic support for the species in the current node is deemed insufficient. Unlike other popular clustering algorithms such as Scimm, our algorithm guarantees convergence. We test our algorithm on simulated metagenomic data containing bacterial species and observe more than 85% precision. We also evaluate our algorithm on an in vitro-simulated bacterial metagenome and show a better clustering even for short reads and varied abundance. The software and datasets can be downloaded from https://github.com/lattcl us/lattice-metage.


BMC Bioinformatics | 2015

Error correction and statistical analyses for intra-host comparisons of feline immunodeficiency virus diversity from high-throughput sequencing data

Yang Liu; Francesca Chiaromonte; Howard A. Ross; Raunaq Malhotra; Daniel Elleder; Mary Poss

BackgroundInfection with feline immunodeficiency virus (FIV) causes an immunosuppressive disease whose consequences are less severe if cats are co-infected with an attenuated FIV strain (PLV). We use virus diversity measurements, which reflect replication ability and the virus response to various conditions, to test whether diversity of virulent FIV in lymphoid tissues is altered in the presence of PLV. Our data consisted of the 3′ half of the FIV genome from three tissues of animals infected with FIV alone, or with FIV and PLV, sequenced by 454 technology.ResultsSince rare variants dominate virus populations, we had to carefully distinguish sequence variation from errors due to experimental protocols and sequencing. We considered an exponential-normal convolution model used for background correction of microarray data, and modified it to formulate an error correction approach for minor allele frequencies derived from high-throughput sequencing. Similar to accounting for over-dispersion in counts, this accounts for error-inflated variability in frequencies – and quite effectively reproduces empirically observed distributions. After obtaining error-corrected minor allele frequencies, we applied ANalysis Of VAriance (ANOVA) based on a linear mixed model and found that conserved sites and transition frequencies in FIV genes differ among tissues of dual and single infected cats. Furthermore, analysis of minor allele frequencies at individual FIV genome sites revealed 242 sites significantly affected by infection status (dual vs. single) or infection status by tissue interaction. All together, our results demonstrated a decrease in FIV diversity in bone marrow in the presence of PLV. Importantly, these effects were weakened or undetectable when error correction was performed with other approaches (thresholding of minor allele frequencies; probabilistic clustering of reads). We also queried the data for cytidine deaminase activity on the viral genome, which causes an asymmetric increase in G to A substitutions, but found no evidence for this host defense strategy.ConclusionsOur error correction approach for minor allele frequencies (more sensitive and computationally efficient than other algorithms) and our statistical treatment of variation (ANOVA) were critical for effective use of high-throughput sequencing data in understanding viral diversity. We found that co-infection with PLV shifts FIV diversity from bone marrow to lymph node and spleen.


pattern recognition in bioinformatics | 2013

Estimating viral haplotypes in a population using k-mer counting

Raunaq Malhotra; Shruthi Prabhakara; Mary Poss; Raj Acharya

Viral haplotype estimation in a population is an important problem in virology. Viruses undergo a high number of mutations and recombinations during replication for their survival in host cells and exist as a population of closely related genetic variants. Due to this, estimating the number of haplotypes and their relative frequencies in the population becomes a challenging task. The usage of a sequenced reference genome has its limitations due to the high mutational rates in viruses. We propose a method for estimating viral haplotypes based only on the counts of k-mers present in the viral population without using the reference genome. We compute k-mer pairs that are related to each other by one mutation, and compute a minimal set of viral haplotypes that explain the whole population based on these k-mer pairs. We compare our method to the software ShoRAH (which uses a reference genome) on simulated dataset and obtained comparable results, even without using a reference genome.


pattern recognition in bioinformatics | 2012

Predicting v(d)j recombination using conditional random fields

Raunaq Malhotra; Shruthi Prabhakara; Raj Acharya

V(D)J gene segments undergo combinatorial recombination in the T-cells and B-cells to provide humans and other vertebrates with a large number of antibodies required for immunity. Each such recombination further undergoes mutations in their DNA sequences so that they can recognize diverse antigens. Predicting the combination of gene segments which formed a particular antibody is an essential task for studying disease propagation and analysis. We propose a model based on conditional random fields (CRFs) for predicting the boundary positions between V-D-J gene segments. We train the CRFs by generating synthetic gene recombinations using all of the alleles of the V, D and J gene segments. The alleles corresponding to a read can be determined by mapping the segmented reads to the DNA sequences of the gene segments using softwares like BLAST and usearch. We test our method on simulated dataset as well as real data of Stanford_S22 individual.


bioRxiv | 2018

A Computational Framework To Assess Genome-Wide Distribution Of Polymorphic Human Endogenous Retrovirus-K In Human Populations

Weiling Li; Lin Lin; Raunaq Malhotra; Lei Yang; Raj Acharya; Mary Poss

Human Endogenous Retrovirus type K (HERV-K) is the only HERV known to be insertionally polymorphic. It is possible that HERV-Ks contribute to human disease because people differ in both number and genomic location of these retroviruses. Indeed viral transcripts, proteins, and antibody against HERV-K are detected in cancers, auto-immune, and neurodegenerative diseases. However, attempts to link a polymorphic HERV-K with any disease have been frustrated in part because population frequency of HERV-K provirus at each site is lacking and it is challenging to identify closely related elements such as HERV-K from short read sequence data. We present an integrated and computationally robust approach that uses whole genome short read data to determine the occupation status at all sites reported to contain a HERV-K provirus. Our method estimates the proportion of fixed length genomic sequence (k-mers) from whole genome sequence data matching a reference set of k-mers unique to each HERV-K loci and applies mixture model-based clustering to account for low depth sequence data. Our analysis of 1000 Genomes Project Data (KGP) reveals numerous differences among the five KGP super-populations in the frequency of individual and co-occurring HERV-K proviruses; we provide a visualization tool to easily depict the prevalence of any combination of HERV-K among KGP populations. Further, the genome burden of polymorphic HERV-K is variable in humans, with East Asian (EAS) individuals having the fewest integration sites. Our study identifies population-specific sequence variation for several HERV-K proviruses. We expect these resources will advance research on HERV-K contributions to human diseases. Author summary Human Endogenous Retrovirus type K (HERV-K) is the youngest of retrovirus families in the human genome and is the only group that is polymorphic; a HERV-K can be present in one individual but absent from others. HERV-Ks could contribute to disease risk but establishing a link of a polymorphic HERV-K to a specific disease has been difficult. We develop an easy to use method that reveals the considerable variation existing among global populations in the frequency of individual and co-occurring polymorphic HERV-K, and in the total number of HERV-K that any individual has in their genome. Our study provides a global reference set of HERV-K genomic diversity and tools needed to determine the genomic landscape of HERV-K in any patient population.


Computational and structural biotechnology journal | 2017

A random forest classifier for detecting rare variants in NGS data from viral populations

Raunaq Malhotra; Manjari Jha; Mary Poss; Raj Acharya

We propose a random forest classifier for detecting rare variants from sequencing errors in Next Generation Sequencing (NGS) data from viral populations. The method utilizes counts of varying length of k-mers from the reads of a viral population to train a Random forest classifier, called MultiRes, that classifies k-mers as erroneous or rare variants. Our algorithm is rooted in concepts from signal processing and uses a frame-based representation of k-mers. Frames are sets of non-orthogonal basis functions that were traditionally used in signal processing for noise removal. We define discrete spatial signals for genomes and sequenced reads, and show that k-mers of a given size constitute a frame. We evaluate MultiRes on simulated and real viral population datasets, which consist of many low frequency variants, and compare it to the error detection methods used in correction tools known in the literature. MultiRes has 4 to 500 times less false positives k-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their de-novo assembly. It has high recall of the true k-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs) and fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is available freely from the GitHub link https://github.com/raunaq-m/MultiRes.


arXiv: Genomics | 2014

Clustering pipeline for determining consensus sequences in targeted next-generation sequencing

Raunaq Malhotra; Daniel Elleder; Le Bao; David R. Hunter; Raj Acharya; Mary Poss


arXiv: Populations and Evolution | 2015

Maximum Likelihood de novo reconstruction of viral populations using paired end sequencing data

Raunaq Malhotra; Manjari Mukhopadhyay Steven Wu; Allen G. Rodrigo; Mary Poss; Raj Acharya

Collaboration


Dive into the Raunaq Malhotra's collaboration.

Top Co-Authors

Avatar

Raj Acharya

Pennsylvania State University

View shared research outputs
Top Co-Authors

Avatar

Mary Poss

Pennsylvania State University

View shared research outputs
Top Co-Authors

Avatar

Shruthi Prabhakara

Pennsylvania State University

View shared research outputs
Top Co-Authors

Avatar

Daniel Elleder

Academy of Sciences of the Czech Republic

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

David R. Hunter

Pennsylvania State University

View shared research outputs
Top Co-Authors

Avatar

Le Bao

University of Washington

View shared research outputs
Top Co-Authors

Avatar

Manjari Jha

Pennsylvania State University

View shared research outputs
Top Co-Authors

Avatar

Manjari Mukhopadhyay

Pennsylvania State University

View shared research outputs
Top Co-Authors

Avatar

Colin M. Gillin

Oregon Department of Fish and Wildlife

View shared research outputs
Researchain Logo
Decentralizing Knowledge