bioRxiv | 2019

SuperCRUNCH: A toolkit for creating and manipulating supermatrices and other large phylogenetic datasets

 
 

Abstract


Phylogenies with extensive taxon sampling have become indispensable for many types of ecological and evolutionary studies. Many large-scale trees are based on a “supermatrix” approach, which involves amalgamating thousands of published sequences for a group. Constructing up-to-date supermatrices can be challenging, especially as new sequences may become available almost constantly. However, few tools exist for assembling large-scale, high-quality supermatrices (and other large datasets) for phylogenetic analysis. Here we present SuperCRUNCH, a Python toolkit for assembling large phylogenetic datasets. It can be applied to GenBank sequences, unpublished sequences, or combinations of GenBank and unpublished data. SuperCRUNCH constructs local databases and uses them to conduct rapid searches for user-specified sets of taxa and loci. Sequences are parsed into putative loci and passed through rigorous filtering steps. A post-filtering step allows for selection of one sequence per taxon (i.e. species-level supermatrix) or retention of all sequences per taxon (i.e. population-level dataset). Importantly, SuperCRUNCH can generate “vouchered” population-level datasets, in which voucher information is used to generate multi-locus phylogeographic datasets. Additionally, SuperCRUNCH offers many options for taxonomy resolution, similarity filtering, sequence selection, alignment, and file manipulation. We demonstrate the range of features available in SuperCRUNCH by generating a variety of phylogenetic datasets. We provide examples using GenBank data, and combinations of GenBank and unpublished data. Output datasets include traditional species-level supermatrices, large-scale phylogenomic matrices, and phylogeographic datasets. Finally, we briefly compare the ability of SuperCRUNCH to construct species-level supermatrices to alternative approaches. SuperCRUNCH generated a large-scale supermatrix (1,400 taxa and 66 loci) from 16GB of GenBank data in ∼1.5 hours, and generated population-level datasets (<350 samples, <10 loci) in<1 minute. It also outperformed alternative methods for supermatrix construction in terms of taxa, loci, and sequences recovered. SuperCRUNCH is a flexible bioinformatics toolkit that can be used to assemble datasets for any taxonomic group and scale (kingdoms to individuals). It allows rapid construction of supermatrices, greatly simplifying the process of updating large phylogenies with new data. It is also designed to produce population-level datasets. SuperCRUNCH streamlines the major tasks required to process phylogenetic data, including filtering, alignment, trimming, and formatting. SuperCRUNCH is open-source, documented, and freely available at https://github.com/dportik/SuperCRUNCH, with example analyses available at https://osf.io/bpt94/.

Volume None
Pages None
DOI 10.1101/538728
Language English
Journal bioRxiv

Full Text