Minh Duc Cao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Minh Duc Cao is active.

Explore More

Publication

Featured researches published by Minh Duc Cao.

data compression conference | 2007

A Simple Statistical Algorithm for Biological Sequence Compression

Minh Duc Cao; Trevor I. Dix; Lloyd Allison; Chris Mears

This paper introduces a novel algorithm for biological sequence compression that makes use of both statistical properties and repetition within sequences. A panel of experts is maintained to estimate the probability distribution of the next symbol in the sequence to be encoded. Expert probabilities are combined to obtain the final distribution. The resulting information sequence provides insight for further study of the biological sequence. Each symbol is then encoded by arithmetic coding. Experiments show that our algorithm outperforms existing compressors on typical DNA and protein sequence datasets while maintaining a practical running time

Nucleic Acids Research | 2014

Inferring short tandem repeat variation from paired-end short reads

Minh Duc Cao; Edward Tasker; Kai Willadsen; Michael Imelfort; Sailaja Vishwanathan; Sridevi Sureshkumar; Sureshkumar Balasubramanian; Mikael Bodén

The advances of high-throughput sequencing offer an unprecedented opportunity to study genetic variation. This is challenged by the difficulty of resolving variant calls in repetitive DNA regions. We present a Bayesian method to estimate repeat-length variation from paired-end sequence read data. The method makes variant calls based on deviations in sequence fragment sizes, allowing the analysis of repeats at lengths of relevance to a range of phenotypes. We demonstrate the method’s ability to detect and quantify changes in repeat lengths from short read genomic sequence data across genotypes. We use the method to estimate repeat variation among 12 strains of Arabidopsis thaliana and demonstrate experimentally that our method compares favourably against existing methods. Using this method, we have identified all repeats across the genome, which are likely to be polymorphic. In addition, our predicted polymorphic repeats also included the only known repeat expansion in A. thaliana, suggesting an ability to discover potential unstable repeats.

Nature Communications | 2017

Scaffolding and completing genome assemblies in real-time with nanopore sequencing

Minh Duc Cao; Son Hoang Nguyen; Devika Ganesamoorthy; Alysha G. Elliott; Matthew A. Cooper; Lachlan Coin

Third generation sequencing technologies provide the opportunity to improve genome assemblies by generating long reads spanning most repeat sequences. However, current analysis methods require substantial amounts of sequence data and computational resources to overcome the high error rates. Furthermore, they can only perform analysis after sequencing has completed, resulting in either over-sequencing, or in a low quality assembly due to under-sequencing. Here we present npScarf, which can scaffold and complete short read assemblies while the long read sequencing run is in progress. It reports assembly metrics in real-time so the sequencing run can be terminated once an assembly of sufficient quality is obtained. In assembling four bacterial and one eukaryotic genomes, we show that npScarf can construct more complete and accurate assemblies while requiring less sequencing data and computational resources than existing methods. Our approach offers a time- and resource-effective strategy for completing short read assemblies.

GigaScience | 2016

Streaming algorithms for identification of pathogens and antibiotic resistance potential from real-time MinION TM sequencing

Minh Duc Cao; Devika Ganesamoorthy; Alysha G. Elliott; Huihui Zhang; Matthew A. Cooper; Lachlan Coin

The recently introduced Oxford Nanopore MinION platform generates DNA sequence data in real-time. This has great potential to shorten the sample-to-results time and is likely to have benefits such as rapid diagnosis of bacterial infection and identification of drug resistance. However, there are few tools available for streaming analysis of real-time sequencing data. Here, we present a framework for streaming analysis of MinION real-time sequence data, together with probabilistic streaming algorithms for species typing, strain typing and antibiotic resistance profile identification. Using four culture isolate samples, as well as a mixed-species sample, we demonstrate that bacterial species and strain information can be obtained within 30 min of sequencing and using about 500 reads, initial drug-resistance profiles within two hours, and complete resistance profiles within 10 h. While strain identification with multi-locus sequence typing required more than 15x coverage to generate confident assignments, our novel gene-presence typing could detect the presence of a known strain with 0.5x coverage. We also show that our pipeline can process over 100 times more data than the current throughput of the MinION on a desktop computer.

Bioinformatics | 2016

Realtime analysis and visualization of MinION sequencing data with npReader

Minh Duc Cao; Devika Ganesamoorthy; Matthew A. Cooper; Lachlan Coin

MOTIVATION The recently released Oxford Nanopore MinION sequencing platform presents many innovative features opening up potential for a range of applications not previously possible. Among these features, the ability to sequence in real-time provides a unique opportunity for many time-critical applications. While many software packages have been developed to analyze its data, there is still a lack of toolkits that support the streaming and real-time analysis of MinION sequencing data. RESULTS We developed npReader, an open-source software package to facilitate real-time analysis of MinION sequencing data. npReader can simultaneously extract sequence reads and stream them to downstream analysis pipelines while the samples are being sequenced on the MinION device. It provides a command line interface for easy integration into a bioinformatics work flow, as well as a graphical user interface which concurrently displays the statistics of the run. It also provides an application programming interface for development of streaming algorithms in order to fully utilize the extent of nanopore sequencing potential. AVAILABILITY AND IMPLEMENTATION npReader is written in Java and is freely available at https://github.com/mdcao/npReader CONTACT [email protected] or [email protected].

BMC Bioinformatics | 2010

A genome alignment algorithm based on compression

Minh Duc Cao; Trevor I. Dix; Lloyd Allison

BackgroundTraditional genome alignment methods consider sequence alignment as a variation of the string edit distance problem, and perform alignment by matching characters of the two sequences. They are often computationally expensive and unable to deal with low information regions. Furthermore, they lack a well-principled objective function to measure the performance of sets of parameters. Since genomic sequences carry genetic information, this article proposes that the information content of each nucleotide in a position should be considered in sequence alignment. An information-theoretic approach for pairwise genome local alignment, namely XMAligner, is presented. Instead of comparing sequences at the character level, XMAligner considers a pair of nucleotides from two sequences to be related if their mutual information in context is significant. The information content of nucleotides in sequences is measured by a lossless compression technique.ResultsExperiments on both simulated data and real data show that XMAligner is superior to conventional methods especially on distantly related sequences and statistically biased data. XMAligner can align sequences of eukaryote genome size with only a modest hardware requirement. Importantly, the method has an objective function which can obviate the need to choose parameter values for high quality alignment. The alignment results from XMAligner can be integrated into a visualisation tool for viewing purpose.ConclusionsThe information-theoretic approach for sequence alignment is shown to overcome the mentioned problems of conventional character matching alignment methods. The article shows that, as genomic sequences are meant to carry information, considering the information content of nucleotides is helpful for genomic sequence alignment.AvailabilityDownloadable binaries, documentation and data can be found at ftp://ftp.infotech.monash.edu.au/software/DNAcompress-XM/XMAligner/.

Briefings in Bioinformatics | 2015

Sequencing technologies and tools for short tandem repeat variation detection

Minh Duc Cao; Sureshkumar Balasubramanian; Mikael Bodén

Short tandem repeats are highly polymorphic and associated with a wide range of phenotypic variation, some of which cause neurodegenerative disease in humans. With advances in high-throughput sequencing technologies, there are novel opportunities to study genetic variation. While available sequencing technologies and bioinformatics tools provide options for mining high-throughput sequencing data, their suitability for analysis of repeat variation is an open question, with tools for quantifying variability in repetitive sequence still in their infancy. We present here a comprehensive survey and empirical evaluation of current sequencing technologies and bioinformatics tools in all stages of an analysis pipeline. While there is not one optimal pipeline to suit all circumstances, we find that the choice of alignment and repeat genotyping tools greatly impacts the accuracy and efficiency by which short tandem repeat variation can be detected. We further note that to detect variation relevant to many repeat diseases, it is essential to choose technologies that offer either long read-lengths or paired-end sequencing, coupled with specific genotyping tools.

knowledge discovery and data mining | 2009

Computing Substitution Matrices for Genomic Comparative Analysis

Minh Duc Cao; Trevor I. Dix; Lloyd Allison

Substitution matrices describe the rates of mutating one character in a biological sequence to another character, and are important for many knowledge discovery tasks such as phylogenetic analysis and sequence alignment. Computing substitution matrices for very long genomic sequences of divergent or even unrelated species requires sensitive algorithms that can take into account differences in composition of the sequences. We present a novel algorithm that addresses this by computing a nucleotide substitution matrix specifically for the two genomes being aligned. The method is founded on information theory and in the expectation maximisation framework. The algorithm iteratively uses compression to align the sequences and estimates the matrix from the alignment, and then applies the matrix to find a better alignment until convergence. Our method reconstructs, with high accuracy, the substitution matrix for synthesised data generated from a known matrix with introduced noise. The model is then successfully applied to real data for various malaria parasite genomes, which have differing phylogenetic distances and composition that lessens the effectiveness of standard statistical analysis techniques.

bioRxiv | 2015

Real-time strain typing and analysis of antibiotic resistance potential using Nanopore MinION sequencing

Minh Duc Cao; Devika Ganesamoorthy; Alysha G. Elliott; Huihui Zhang; Matthew A. Cooper; Lachlan Coin

Clinical pathogen sequencing has significant potential to drive informed treatment of patients with unknown bacterial infection. However, the lack of rapid sequencing technologies with concomitant analysis has impeded clinical adoption in infection diagnosis. Here we demonstrate that commercially-available Nanopore sequencing devices can identify bacterial species and strain information with less than one hour of sequencing time, initial drug-resistance profiles within 2 hours, and a complete resistance profile within 12 hours. We anticipate these devices and associated analysis methods may become useful clinical tools to guide appropriate therapy in time-critical clinical presentations such as bacteraemia and sepsis.The recently introduced Oxford Nanopore MinION platform generates DNA sequence data in real-time. This opens immense potential to shorten the sample-to-results time and is likely to lead to enormous benefits in rapid diagnosis of bacterial infection and identification of drug resistance. However, there are very few tools available for streaming analysis of real-time sequencing data. Here, we present a framework for streaming analysis of MinION real-time sequence data, together with probabilistic streaming algorithms for species typing, multi-locus strain typing, gene presence strain-typing and antibiotic resistance profile identification. Using three culture isolate samples as well as a mixed-species sample, we demonstrate that bacterial species and strain information can be obtained within 30 minutes of sequencing and using about 500 reads, initial drug-resistance profiles within two hours, and complete resistance profiles within 10 hours. Multi-locus strain typing required more than 15x coverage to generate confident assignments, whereas gene-presence typing could detect the presence of a known strain with 0.5x coverage. We also show that our pipeline can process over 100 times more data than the current throughput of the MinION on a desktop computer.

GigaScience | 2018

Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning

Haotian Teng; Minh Duc Cao; Michael B Hall; Tania Duarte; Sheng Wang; Lachlan Coin

Abstract Sequencing by translocating DNA fragments through an array of nanopores is a rapidly maturing technology that offers faster and cheaper sequencing than other approaches. However, accurately deciphering the DNA sequence from the noisy and complex electrical signal is challenging. Here, we report Chiron, the first deep learning model to achieve end-to-end basecalling and directly translate the raw signal to DNA sequence without the error-prone segmentation step. Trained with only a small set of 4,000 reads, we show that our model provides state-of-the-art basecalling accuracy, even on previously unseen species. Chiron achieves basecalling speeds of more than 2,000 bases per second using desktop computer graphics processing units.

Explore More