Bioinformatics | 2019

SolidBin: improving metagenome binning with semi-supervised normalized cut

 
 
 
 
 

Abstract


MOTIVATION\nMetagenomic contig binning is an important computational problem in metagenomic research, which aims to cluster contigs from the same genome into the same group. Unlike classical clustering problem, contig binning can utilize known relationships among some of the contigs or the taxonomic identity of some contigs. However, the current state-of-the-art contig binning methods do not make full use of the additional biological information except the coverage and sequence composition of the contigs.\n\n\nRESULTS\nWe developed a novel contig binning method, SolidBin (Semi-supervised Spectral Normalized Cut for Binning), based on semi-supervised spectral clustering. Using sequence feature similarity and/or additional biological information, such as the reliable taxonomy assignments of some contigs, SolidBin constructs two types of prior information: must-link and cannot-link constraints. Must-link constraints mean that the pair of contigs should be clustered into the same group, while cannot-link constraints mean that the pair of contigs should be clustered in different groups. These constraints are then integrated into a classical spectral clustering approach, normalized cut (NCut), for improved contig binning. The performance of SolidBin is compared with five state-of-the-art genome binners, CONCOCT, COCACOLA, MaxBin, MetaBAT and BMC3C on five next-generation sequencing (NGS) benchmark datasets including simulated multi- and single-sample datasets and real multi-sample datasets. The experimental results show that, SolidBin has achieved the best performance in terms of F-score, ARI and NMI, especially while using the real datasets and the single sample dataset.\n\n\nAVAILABILITY\nhttps://github.com/sufforest/SolidBin.\n\n\nSUPPLEMENTARY INFORMATION\nSupplementary data are available at Bioinformatics online.

Volume None
Pages None
DOI 10.1093/bioinformatics/btz253
Language English
Journal Bioinformatics

Full Text