Chi-Man Liu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chi-Man Liu is active.

Explore More

Publication

Featured researches published by Chi-Man Liu.

Bioinformatics | 2015

MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph

Dinghua Li; Chi-Man Liu; Ruibang Luo; Kunihiko Sadakane; Tak Wah Lam

MEGAHIT is a NGS de novo assembler for assembling large and complex metagenomics data in a time- and cost-efficient manner. It finished assembling a soil metagenomics dataset with 252 Gbps in 44.1 and 99.6 h on a single computing node with and without a graphics processing unit, respectively. MEGAHIT assembles the data as a whole, i.e. no pre-processing like partitioning and normalization was needed. When compared with previous methods on assembling the soil data, MEGAHIT generated a three-time larger assembly, with longer contig N50 and average contig length; furthermore, 55.8% of the reads were aligned to the assembly, giving a fourfold improvement.

Bioinformatics | 2012

SOAP3: ultra-fast GPU-based parallel alignment tool for short reads

Chi-Man Liu; Thomas K. F. Wong; Edward Wu; Ruibang Luo; Siu-Ming Yiu; Yingrui Li; Bingqiang Wang; Chang Yu; Xiaowen Chu; Kaiyong Zhao; Ruiqiang Li; Tak Wah Lam

SOAP3 is the first short read alignment tool that leverages the multi-processors in a graphic processing unit (GPU) to achieve a drastic improvement in speed. We adapted the compressed full-text index (BWT) used by SOAP2 in view of the advantages and disadvantages of GPU. When tested with millions of Illumina Hiseq 2000 length-100 bp reads, SOAP3 takes < 30 s to align a million read pairs onto the human reference genome and is at least 7.5 and 20 times faster than BWA and Bowtie, respectively. For aligning reads with up to four mismatches, SOAP3 aligns slightly more reads than BWA and Bowtie; this is because SOAP3, unlike BWA and Bowtie, is not heuristic-based and always reports all answers.

PLOS ONE | 2013

SOAP3-dp: Fast, Accurate and Sensitive GPU-Based Short Read Aligner

Ruibang Luo; Thomas K. F. Wong; Jianqiao Zhu; Chi-Man Liu; Xiaoqian Zhu; Edward Wu; Lap-Kei Lee; Haoxiang Lin; Wenjuan Zhu; David W. Cheung; Hing-Fung Ting; Siu-Ming Yiu; Shaoliang Peng; Chang Yu; Yingrui Li; Ruiqiang Li; Tak Wah Lam

To tackle the exponentially increasing throughput of Next-Generation Sequencing (NGS), most of the existing short-read aligners can be configured to favor speed in trade of accuracy and sensitivity. SOAP3-dp, through leveraging the computational power of both CPU and GPU with optimized algorithms, delivers high speed and sensitivity simultaneously. Compared with widely adopted aligners including BWA, Bowtie2, SeqAlto, CUSHAW2, GEM and GPU-based aligners BarraCUDA and CUSHAW, SOAP3-dp was found to be two to tens of times faster, while maintaining the highest sensitivity and lowest false discovery rate (FDR) on Illumina reads with different lengths. Transcending its predecessor SOAP3, which does not allow gapped alignment, SOAP3-dp by default tolerates alignment similarity as low as 60%. Real data evaluation using human genome demonstrates SOAP3-dps power to enable more authentic variants and longer Indels to be discovered. Fosmid sequencing shows a 9.1% FDR on newly discovered deletions. SOAP3-dp natively supports BAM file format and provides the same scoring scheme as BWA, which enables it to be integrated into existing analysis pipelines. SOAP3-dp has been deployed on Amazon-EC2, NIH-Biowulf and Tianhe-1A.

Methods | 2016

MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices

Dinghua Li; Ruibang Luo; Chi-Man Liu; Chi-Ming Leung; Hing-Fung Ting; Kunihiko Sadakane; Hiroshi Yamashita; Tak Wah Lam

The study of metagenomics has been much benefited from low-cost and high-throughput sequencing technologies, yet the tremendous amount of data generated make analysis like de novo assembly to consume too much computational resources. In late 2014 we released MEGAHIT v0.1 (together with a brief note of Li et al. (2015) [1]), which is the first NGS metagenome assembler that can assemble genome sequences from metagenomic datasets of hundreds of Giga base-pairs (bp) in a time- and memory-efficient manner on a single server. The core of MEGAHIT is an efficient parallel algorithm for constructing succinct de Bruijn Graphs (SdBG), implemented on a graphical processing unit (GPU). The software has been well received by the assembly community, and there is interest in how to adapt the algorithms to integrate popular assembly practices so as to improve the assembly quality, as well as how to speed up the software using better CPU-based algorithms (instead of GPU). In this paper we first describe the details of the core algorithms in MEGAHIT v0.1, and then we show the new modules to upgrade MEGAHIT to version v1.0, which gives better assembly quality, runs faster and uses less memory. For the Iowa Prairie Soil dataset (252Gbp after quality trimming), the assembly quality of MEGAHIT v1.0, when compared with v0.1, has a significant improvement, namely, 36% increase in assembly size and 23% in N50. More interestingly, MEGAHIT v1.0 is no slower than before (even running with the extra modules). This is primarily due to a new CPU-based algorithm for SdBG construction that is faster and requires less memory. Using CPU only, MEGAHIT v1.0 can assemble the Iowa Prairie Soil sample in about 43h, reducing the running time of v0.1 by at least 25% and memory usage by up to 50%. MEGAHIT v1.0, exhibiting a smaller memory footprint, can process even larger datasets. The Kansas Prairie Soil sample (484Gbp), the largest publicly available dataset, can now be assembled using no more than 500GB of memory in 7.5days. The assemblies of these datasets (and other large metgenomic datasets), as well as the software, are available at the website https://hku-bal.github.io/megabox.

PeerJ | 2014

BALSA: integrated secondary analysis for whole-genome and whole-exome sequencing, accelerated by GPU.

Ruibang Luo; Yiu-Lun Wong; Wai-Chun Law; Lap-Kei Lee; Jeanno Cheung; Chi-Man Liu; Tak Wah Lam

This paper reports an integrated solution, called BALSA, for the secondary analysis of next generation sequencing data; it exploits the computational power of GPU and an intricate memory management to give a fast and accurate analysis. From raw reads to variants (including SNPs and Indels), BALSA, using just a single computing node with a commodity GPU board, takes 5.5 h to process 50-fold whole genome sequencing (∼750 million 100 bp paired-end reads), or just 25 min for 210-fold whole exome sequencing. BALSA’s speed is rooted at its parallel algorithms to effectively exploit a GPU to speed up processes like alignment, realignment and statistical testing. BALSA incorporates a 16-genotype model to support the calling of SNPs and Indels and achieves competitive variant calling accuracy and sensitivity when compared to the ensemble of six popular variant callers. BALSA also supports efficient identification of somatic SNVs and CNVs; experiments showed that BALSA recovers all the previously validated somatic SNVs and CNVs, and it is more sensitive for somatic Indel detection. BALSA outputs variants in VCF format. A pileup-like SNAPSHOT format, while maintaining the same fidelity as BAM in variant calling, enables efficient storage and indexing, and facilitates the App development of downstream analyses. BALSA is available at: http://sourceforge.net/p/balsa.

international colloquium on automata languages and programming | 2011

Sleep management on multiple machines for energy and flow time

Sze-Hang Chan; Tak Wah Lam; Lap-Kei Lee; Chi-Man Liu; Hing-Fung Ting

In large data centers, determining the right number of operating machines is often non-trivial, especially when the workload is unpredictable. Using too many machines would waste energy, while using too few would affect the performance. This paper extends the traditional study of online flow-time scheduling on multiple machines to take sleep management and energy into consideration. Specifically, we study online algorithms that can determine dynamically when and which subset of machines should wake up (or sleep), and how jobs are dispatched and scheduled. We consider schedules whose objective is to minimize the sum of flow time and energy, and obtain O(1)-competitive algorithms for two settings: one assumes machines running at a fixed speed, and the other allows dynamic speed scaling to further optimize energy usage. Like the previous work on the tradeoff between flow time and energy, the analysis of our algorithms is based on potential functions. What is new here is that the online and offline algorithms would use different subsets of machines at different times, and we need a more general potential analysis that can consider different match-up of machines.

asia-pacific bioinformatics conference | 2005

A more accurate and efficient whole genome phylogeny

P. Y. Chan; Tak Wah Lam; Siu-Ming Yiu; Chi-Man Liu

To reconstruct a phylogeny for a given set of species, most of the previous approaches are based on the similarity information derived from a subset of conserved regions (or genes) in the corresponding genomes. In some cases, the regions chosen may not reflect the evolutionary history of the species and may be too restricted to differentiate the species. It is generally believed that the inference could be more accurate if whole genomes are being considered. The best existing solution that makes use of complete genomes was proposed by Henz et al.13 They can construct a phylogeny for 91 prokaryotic genomes in 170 CPU hours with an accuracy of about 70% (based on the measurement of non-trivial splits) while other approaches that use whole genomes can only deal with no more than 20 species. Note that Henz et al. measure the distance between the species using BLASTN which is not primarily designed for whole genome alignment. Also, their approach is not scalable, for example, it probably takes over 1000 CPU hours to construct a phylogeny for all 230 prokaryotic genomes published by NCBI. In addition, we found that non-trivial splits is only a rough indicator of the accuracy of the phylogeny. In this paper, we propose the followings. (1) To evaluate the quality of a phylogeny with respect to a model answer, we suggest to use the concept of the maximum agreement subtree as it can capture the structure of the phylogeny. (2) We propose to use whole genome alignment software (such as MUMmer) to measure the distances between the species and derive an efficient approach to generate these distances. From the experiments on real data sets, we found that our approach is more accurate and more scalable than Henz et al.’s approach. We can construct a phylogenetic tree for the same set of 91 genomes with an accuracy more than 20% higher (with respect to both evaluation measures) in 2 CPU hours (more than 80 times faster than their approach). Also, our approach is scalable and can construct a phylogeny for 230 prokaryotic genomes with accuracy as high as 85% in only 9.5 CPU hours.

workshop on approximation and online algorithms | 2010

Online tracking of the dominance relationship of distributed multi-dimensional data

Tak Wah Lam; Chi-Man Liu; Hing-Fung Ting

We consider the online problem for a root (or a coordinator) to maintain a set of filters for the purpose of keeping track of the dominance relationship of some distributed multi-dimensional data. Such data keep changing from time to time. The objective is to minimize the communication between the root and the distributed data sources. Assume that data are chosen from the d-dimensional grid {1, 2, ..., U}d, we give an O(d log U)-competitive algorithm for this online problem. The competitive ratio is asymptotically tight as it is relatively easy to show an Ω(d log U) lower bound.

Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine | 2012

Efficient SNP-sensitive alignment and database-assisted SNP calling for low coverage samples

Ruibang Luo; Chang Yu; Chi-Man Liu; Tak Wah Lam; Thomas K. F. Wong; Siu-Ming Yiu; Ruiqiang Li; Hing-Fung Ting

We have designed and implemented an efficient tool for short read alignment that is sensitive to a given set of SNP. In particular, it returns alignments that permit mismatches at these SNPs. We then make use of it to develop a method for detecting SNPs, which allows user to provide annotated SNPs classified in previous studies and use them to guide the execution. By focusing on alignments covering these SNPs, our method greatly accelerates the detection of SNPs at prescribed loci. The annotated SNPs also help us distinguish sequencing errors from authentic SNP alleles easily. We have compared our method with existing methods on several applications. We found that our method have higher accuracy, especially for samples with low coverage. It is faster and can be about two orders of magnitude faster for some applications.

BMC Bioinformatics | 2015

MICA: a fast short-read aligner that takes full advantage of Many Integrated Core Architecture (MIC)

Ruibang Luo; Jeanno Cheung; Edward Wu; Heng Wang; Sze-Hang Chan; Wai-Chun Law; Guangzhu He; Chang Yu; Chi-Man Liu; Dazong Zhou; Yingrui Li; Ruiqiang Li; Jun Wang; Xiaoqian Zhu; Shaoliang Peng; Tak Wah Lam

Explore More