Bioinformatics | 2021

FastqCLS: a FASTQ Compressor for Long-read Sequencing via read reordering using a novel scoring model.

 
 

Abstract


MOTIVATION\nOver the past decades, vast amounts of genome sequencing data have been produced, requiring an enormous level of storage capacity. The time and resources needed to store and transfer such data cause bottlenecks in genome sequencing analysis. To resolve this issue, various compression techniques have been proposed to reduce the size of original FASTQ raw sequencing data, but these remain suboptimal. Long-read sequencing has become dominant in genomics, whereas most existing compression methods focus on short-read sequencing only.\n\n\nRESULTS\nWe designed a compression algorithm based on read reordering using a novel scoring model for reducing FASTQ file size with no information loss. We integrated all data processing steps into a software package called FastqCLS and provided it as a Docker image for ease of installation and execution to help users easily install and run. We compared our method with existing major FASTQ compression tools using benchmark datasets. We also included new long-read sequencing data in this validation. As a result, FastqCLS outperformed in terms of compression ratios for storing long-read sequencing data.\n\n\nAVAILABILITY AND IMPLEMENTATION\nFastqCLS can be downloaded from https://github.com/krlucete/FastqCLS.\n\n\nSUPPLEMENTARY INFORMATION\nSupplementary data are available at Bioinformatics online.

Volume None
Pages None
DOI 10.1093/bioinformatics/btab696
Language English
Journal Bioinformatics

Full Text