Nam S Vo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nam S Vo is active.

Explore More

Publication

Featured researches published by Nam S Vo.

international conference on computational advances in bio and medical sciences | 2014

How genome complexity can explain the hardness of aligning reads to genomes

Vinhthuy Phan; Shanshan Gao; Quang Tran; Nam S Vo

Although it is known that aligning short reads to reference genomes becomes harder if such genomes are embedded with complex repeat structures, there has been little effort to quantify this intuition. We investigated several measures of complexity, employed 10 popular short-read aligners to align a large number of diverse genomes, and found that unlike existing notions of complexity, a proposed notion of length sensitive measures correlated highly with the hardness of short-read alignment. This result enables speedy estimation of the hardness of alignment without aligning millions of reads to unknown genomes.

BMC Genomics | 2014

RandAL: a randomized approach to aligning DNA sequences to reference genomes

Nam S Vo; Quang Tran; Nobal B. Niraula; Vinhthuy Phan

BackgroundThe alignment of short reads generated by next-generation sequencers to genomes is an important problem in many biomedical and bioinformatics applications. Although many proposed methods work very well on narrow ranges of read lengths, they tend to suffer in performance and alignment quality for reads outside of these ranges.ResultsWe introduce RandAL, a novel method that aligns DNA sequences to reference genomes. Our approach utilizes two FM indices to facilitate efficient bidirectional searching, a pruning heuristic to speed up the computing of edit distances, and most importantly, a randomized strategy that enables effective estimation of key parameters. Extensive comparisons showed that RandAL outperformed popular aligners in most instances and was unique in its consistent and accurate performance over a wide range of read lengths and error rates. The software package is publicly available at https://github.com/namsyvo/RandAL.ConclusionsRandAL promises to align effectively and accurately short reads that come from a variety of technologies with different read lengths and rates of sequencing error.

BMC Bioinformatics | 2014

Exploiting dependencies of pairwise comparison outcomes to predict patterns of gene response

Nam S Vo; Vinhthuy Phan

BackgroundThe analysis of gene expression has played an important role in medical and bioinformatics research. Although it is known that a large number of samples is needed to determine the patterns of gene expression accurately, practical designs of gene expression studies occasionally have insufficient numbers of samples, making it difficult to ascertain true response patterns of variantly expressed genes.ResultsWe describe an approach to cope with the challenge of predicting true orders of gene response to treatments. We show that true patterns of gene response must be orderable sets. In experiments with few samples, we modify the conventional pairwise comparison tests and increase the significance level α intelligently to deduce orderable patterns, which are most likely true orders of gene response. Additionally, motivated by the fact that a gene can be involved in multiple biological functions, our method further resamples experimental replicates and predicts multiple response patterns for each gene.Using a gene expression data set of Sprague-Dawley rats treated with chemopreventive chemical compounds and DAVID to annotate and validate gene sets, we showed that compared to the conventional method of fixing α, this method increased enrichment significantly. A comparison with hierarchical clustering showed that gene clusters labelled by response patterns produced by our method were much more enriched. One of the clusters contained 3 transcription factors, which hierarchical clustering failed to place into one cluster, that have been found to participate in multiple biological networks. One of the transcription factors is known to play an important role in pathways affected by the studied chemical compounds.ConclusionsThis method can be useful in designing cost-effective experiments with small sample sizes. Patterns of highly-variantly expressed genes can be predicted by varying α intelligently. Furthermore, clusters are labeled meaningfully with patterns that describe precisely how genes in such clusters respond to treatments.

international symposium on bioinformatics research and applications | 2013

Exploiting Dependencies of Patterns in Gene Expression Analysis Using Pairwise Comparisons

Nam S Vo; Vinhthuy Phan

In using pairwise comparisons to analyze gene expression data, researchers have often treated comparison outcomes independently. We now exploit additional dependencies of comparison outcomes to show that those with a certain property cannot be true patterns of genes’ response to treatments. With this result, we leverage p-values obtained from comparison outcomes to predict true patterns of gene response to treatments. Functional validation of gene lists obtained from our method yielded more and better functional enrichment than those obtained from the conventional approach. Consequently, our method promises to be useful in designing cost-effective experiments with small sample sizes.

Bioinformatics | 2018

Leveraging known genomic variants to improve detection of variants, especially close-by Indels

Nam S Vo; Vinhthuy Phan

Motivation The detection of genomic variants has great significance in genomics, bioinformatics, biomedical research and its applications. However, despite a lot of effort, Indels and structural variants are still under-characterized compared to SNPs. Current approaches based on next-generation sequencing data usually require large numbers of reads (high coverage) to be able to detect such types of variants accurately. However Indels, especially those close to each other, are still hard to detect accurately. Results We introduce a novel approach that leverages known variant information, e.g. provided by dbSNP, dbVar, ExAC or the 1000 Genomes Project, to improve sensitivity of detecting variants, especially close-by Indels. In our approach, the standard reference genome and the known variants are combined to build a meta-reference, which is expected to be probabilistically closer to the subject genomes than the standard reference. An alignment algorithm, which can take into account known variant information, is developed to accurately align reads to the meta-reference. This strategy resulted in accurate alignment and variant calling even with low coverage data. We showed that compared to popular methods such as GATK and SAMtools, our method significantly improves the sensitivity of detecting variants, especially Indels that are close to each other. In particular, our method was able to call these close-by Indels at a 15-20% higher sensitivity than other methods at low coverage, and still get 1-5% higher sensitivity at high coverage, at competitive precision. These results were validated using simulated data with variant profiles extracted from the 1000 Genomes Project data, and real data from the Illumina Platinum Genomes Project and ExAC database. Our finding suggests that by incorporating known variant information in an appropriate manner, sensitive variant calling is possible at a low cost. Availability and implementation Implementation can be found in our public code repository https://github.com/namsyvo/IVC. Supplementary information Supplementary data are available at Bioinformatics online.

BMC Bioinformatics | 2015

How genome complexity can explain the difficulty of aligning reads to genomes

Vinhthuy Phan; Shanshan Gao; Quang Tran; Nam S Vo

BackgroundAlthough it is frequently observed that aligning short reads to genomes becomes harder if they contain complex repeat patterns, there has not been much effort to quantify the relationship between complexity of genomes and difficulty of short-read alignment. Existing measures of sequence complexity seem unsuitable for the understanding and quantification of this relationship.ResultsWe investigated several measures of complexity and found that length-sensitive measures of complexity had the highest correlation to accuracy of alignment. In particular, the rate of distinct substrings of length k, where k is similar to the read length, correlated very highly to alignment performance in terms of precision and recall. We showed how to compute this measure efficiently in linear time, making it useful in practice to estimate quickly the difficulty of alignment for new genomes without having to align reads to them first. We showed how the length-sensitive measures could provide additional information for choosing aligners that would align consistently accurately on new genomes.ConclusionsWe formally established a connection between genome complexity and the accuracy of short-read aligners. The relationship between genome complexity and alignment accuracy provides additional useful information for selecting suitable aligners for new genomes. Further, this work suggests that the complexity of genomes sometimes should be thought of in terms of specific computational problems, such as the alignment of short reads to genomes.

BMC Bioinformatics | 2015

Improving variant calling by incorporating known genetic variants into read alignment

Nam S Vo; Vinhthuy Phan

Background The identification of genetic variants has great significance in genetic research. To call variants using next-generation sequencing data, current methods rely primarily on mapped reads produced by a separate read aligner without taking into account existing genetic variants [1]. Thus, these methods usually require a large number of reads (high coverage) to be able to detect variants accurately [2]. Moreover, the separation of read alignment and variant calling results in a workflow is complex and involves many separate steps and different tools [3].

BMC Bioinformatics | 2015

A linear model for predicting performance of short-read aligners using genome complexity.

Quang Tran; Shanshan Gao; Nam S Vo; Vinhthuy Phan

Background The effectiveness and accuracy of aligning short reads to genomes have an important impact on many applications that rely on next-generation sequencing data. The computational requirements and material cost for aligning largescale short reads to genomes is also expensive. To prevent wasted time and resources for aligning short reads, we investigated the different measures of genome complexity [1] that correlated best to the performance of alignment to propose a linear model for each aligning method [2].

BMC Bioinformatics | 2014

An integrated approach for SNP calling based on population of genomes

Nam S Vo; Quang Tran; Vinhthuy Phan

Background The identification of genetic variants such as single nucleotide polymorphisms (SNPs) is a critical step in many applications based on NGS technologies [1]. Although many SNP calling programs have been developed, it is still challenging to accurately call SNPs, especially when coverage level is low [2]. Moreover, the determination of SNPs, which is performed through many separate steps, requires a careful selection of a diverse set of tools [3,4]. This can lead to several disadvantages, for example, one cannot incorporate information from the read alignment step into the SNP calling step or vice versa to help improve accuracy of called SNPs. Materials and methods We propose a novel integrated approach to detect more true SNPs while calling fewer false positives. Different from current methods that perform read alignment and SNP calling steps separately, our method combines them methodologically to improve the accuracy of SNP identification. To effectively exploit information from a population of genomes, databases of confirmed SNPs, such as dbSNP, are employed in both aligning reads to references as well as calling SNPs. This strategy allows us to develop a novel algorithm to align reads to references that can differentiate sequencing errors from SNPs. Results Based on this result, the method can call SNPs accurately and effectively even with low-coverage sequencing data. Our results on simulated data show that the method is able to call SNPs with very high precision and recall rate with low-coverage datasets. Conclusions With the existence of databases of confirmed SNPs for large amounts of sequenced species, our approach provides a promising method to call accurate SNP information even with low-coverage sequencing data. This approach can also help researchers facilitate the determination of SNPs by using an integrated SNP calling tool.

BMC Bioinformatics | 2014

Exploiting the bootstrap method to analyze patterns of gene expression

Nam S Vo; Vinhthuy Phan

Background High-throughput technologies like microarrays or the recent RNA-Seq provide large amounts of data for gene expression studies. Although there have been diverse methods to design gene-expression experiments and analyze gene-expression data, the prediction of true patterns of gene expression in case of having few samples remains a challenging problem [1,2]. Materials and methods We propose a method to predict response patterns of gene expression studies in the case of small sample size using a bootstrap method [3]. Our approach adopts partially order sets (posets) to represent gene patterns, which are determined based on pairwise comparisons [4]. Results We show that patterns that are not linearly orderable cannot be true patterns of gene response to treatments. From this result, we propose a strategy using bootstrap resampling to infer true responses of non-linearly-orderable patterns. Our experiments showed that this method produced gene lists with more biological functional enrichment than those obtained without bootstrap resampling. Conclusions Our method is useful in designing cost-effective experiments with small sample sizes. Researchers can still use a small sample size to determine true patterns for most genes. For highly-variantly expressed genes, their true patterns can be identified using the proposed method.

Explore More