aa r X i v : . [ c s . D B ] D ec SARS-CoV-2 Coronavirus Data CompressionBenchmark
Innar Liiv
Tallinn University of Technology, 15A Akadeemia Rd, 12618 Tallinn, Estonia [email protected]
Abstract.
This paper introduces a lossless data compression competi-tion that benchmarks solutions (computer programs) by the compressedsize of the 44,981 concatenated SARS-CoV-2 sequences, with a totaluncompressed size of 1,339,868,341 bytes. The data, downloaded on 13December 2020, from the severe acute respiratory syndrome coronavirus2 data hub of ncbi.nlm.nih.gov is presented in FASTA and 2Bit format.The aim of this competition is to encourage multidisciplinary researchto find the shortest lossless description for the sequences and to demon-strate that data compression can serve as an objective and repeatablemeasure to align scientific breakthroughs across disciplines. The shortestdescription of the data is the best model; therefore, further reducing thesize of this description requires a fundamental understanding of the un-derlying context and data. This paper presents preliminary results withmultiple well-known compression algorithms for baseline measurements,and insights regarding promising research avenues. The competition’sprogress will be reported at https://coronavirus.innar.com , and thebenchmark is open for all to participate and contribute.
Keywords: lossless data compression · benchmark · SARS-CoV-2
Marvin Minsky considered Kolmogorov, Chaitin, and Solomonoff’s algorithmicinformation theory “the most important discovery since G¨odel” and conjecturedthat “practical approximations to [their theory]. . . would make better predictionsthan anything we have today” [21].This competition is intended to encourage multidisciplinary research, in thespirit of Kolmogorov, Chaitin, and Solomonoff’s theory, to develop the shortestlossless description for the sequences of SARS-CoV-2. A successful result willserve as a demonstration that data compression can offer an objective and re-peatable measure to align scientific breakthroughs across disciplines. The short-est description of a dataset is the best model. Further compression of the se-quences of SARS-CoV-2 will require a fundamental understanding of the dataand its context.
I. Liiv
The main theoretical underpinnings of this benchmark are the practical approx-imations to Kolmogorov–Chaitin–Solomonoff complexity by Li and Vitanyi [16]and the minimum description length principle [24,9].Kolmogorov complexity is the length of the shortest effective description ofan object [14]. Therefore, the idealistic goal of the SARS-CoV-2 coronavirus datacompression benchmark is to find the Kolmogorov complexity of SARS-CoV-2[29] sequences. Since doing so requires an infinite amount of work, as a practicalapproximation, the smallest archive plus the decompressor is considered a com-putable proxy. Matt Mahoney has written an inspiring and excellent rationalefor a large text compression benchmark [19] with an extended discussion aboutthe connections between intelligence and compression.Several lossless compression benchmarks have been proposed over the years[3,11], the most well-known by Marcus Hutter, who offered € Losslessly compress the 1.25GB file coronavirus.fasta [17] or its 2bit representa-tion equivalent coronavirus.2bit (0.31 GB) [17] to less than 1,238,330 bytes (thecurrent smallest compressed size of the dataset, including the decompressor).
The data is presented in FASTA and 2Bit (UCSC-twobit [12]) format, consistingof 44,981 concatenated SARS-CoV-2 sequences with a total uncompressed sizeof 1,339,868,341 bytes [17]. It was downloaded on 13 December 2020 from thesevere acute respiratory syndrome coronavirus 2 data hub of ncbi.nlm.nih.gov[23]. Each participant can choose which file to use—that is, the compressor doesnot have to work on both datasets.
ARS-CoV-2 Coronavirus Data Compression Benchmark 3
The challenge is to compress 44,981 concatenated SARS-CoV-2 sequences. Toprovide a slightly simpler example, more susceptible to manual observation, thecompression results for one sequence (reference sequence NC 045512 [29]) arepresented in Table 1.
Table 1.
Compression results of the the reference sequence NC 045512 (fewer bytes isbetter)Bytes File Format Compressor Parameters7233 NC 045512 FASTA cmix [13]7277 NC 045512 FASTA paq8l [18] -87277 NC 045512 FASTA GeCo3 [28] -l 1 -lr 0.06 -hs 87308 NC 045512 2Bit cmix [13]7337 NC 045512 2Bit brotli [1] -q 107346 NC 045512 2Bit paq8l [18] -87355 NC 045512 2Bit zstd [4] -197369 NC 045512 2Bit bcm [22] -97376 NC 045512 2Bit gzip [26] -97508 NC 045512 2Bit xz [5] -97517 NC 045512 2Bit zip [26] -97524 NC 045512 2Bit
Uncompressed
Uncompressed
I. Liiv
Table 2 presents the current compression results for for 44,981 SARS-CoV-2 se-quences (with a total uncompressed size of 1,339,868,341 bytes) sorted by thenumber of bytes (with fewer bytes meaning better compressibility), acting as thebaseline measurement for the challenge. The bytes column in Table 2 did notinclude the size of the decompressor, which will be considered in the final bench-mark. Considering the total size of the compressed archive and the decompressor(instead of just considering the compressed archive), the PAQ8L compressor [18]by Matt Mahoney performed the best, with the best results achieved using the2Bit format of the dataset. The resulting compressed archive for PAQ8L, in-cluding the compressed decompression executable, has a total size of 1,238,330(1,207,839+30,491) bytes. The CMIX compressor [13] by Byron Knoll resulteda smaller compressed archive (988,958), but the total size, including the com-pressed decompressor, is 1,282,852 (988,958+293,894).
Table 2.
Compression results for 44,981 SARS-CoV-2 sequences (fewer bytes is better)Bytes File Format Compressor Parameters988,958 Coronavirus 2Bit cmix [13]1,207,839 Coronavirus 2Bit paq8l [18] -81,425,590 Coronavirus FASTA paq8l [18] -81,985,384 Coronavirus 2Bit xz [5] -92,022,796 Coronavirus FASTA xz [5] -92,043,140 Coronavirus 2Bit bcm [22] -92,044,664 Coronavirus 2Bit rar [25] m52,050,285 Coronavirus FASTA GeCo3 [28] -l 1 -lr 0.06 -hs 82,367,487 Coronavirus 2Bit zstd [4] -192,728,490 Coronavirus FASTA bcm [22] [22] -92,871,864 Coronavirus 2Bit brotli [1] -q 102,871,864 Coronavirus FASTA brotli [1] -q 104,217,341 Coronavirus FASTA zstd [4] -195,924,805 Coronavirus FASTA rar [25] m567,575,178 Coronavirus 2Bit gzip [26] -967,575,325 Coronavirus 2Bit zip [26] -975,530,790 Coronavirus 2Bit bzip2 [27] -975,530,790 Coronavirus FASTA bzip2 [27] -977,356,405 Coronavirus FASTA gzip [26] -977,356,550 Coronavirus FASTA zip [26] -9332,133,731 Coronavirus 2Bit Uncompressed1,339,868,341 Coronavirus FASTA UncompressedARS-CoV-2 Coronavirus Data Compression Benchmark 5
The sequences of the SARS-CoV-2 coronavirus are compressible. Further com-pression will require a mix of novel and creative approaches: moving beyond thestate of the art of data compression or understanding the patterns and relation-ships within parts of sequences and between sequences.The SARS-CoV-2 coronavirus data compression benchmark has a vital mul-tidisciplinary aspect: the objective and repeatable measure in this challenge canhelp to align scientific breakthroughs across disciplines. At the end of the day, dif-ferent theories and models to understand the coronavirus are measurable throughthe shortest description of the dataset.In addition, the scientific momentum around and attention paid to SARS-CoV-2 can be applied to support breakthroughs by the data compression com-munity and advance the state of the art of compression. The techniques usedfor improving the compression of SARS-CoV-2 datasets can feed back to betterunderstanding the underlying mechanisms of the coronavirus.
References
1. Alakuijala, J., Farruggia, A., Ferragina, P., Kliuchnikov, E., Obryk, R., Szabadka,Z., Vandevenne, L.: Brotli: A general-purpose data compressor. ACM Transactionson Information Systems (TOIS) (1), 1–30 (2018)2. Bonfield, J.K., Mahoney, M.V.: Compression of fastq and sam format sequencingdata. PloS one (3), e59190 (2013)3. Broukhis, L.: The human knowledge compression prize. URLhttp://mailcom.com/challenge/ (1996)4. Collet, Y., Kucherawy, M.: Zstandard compression and the application/zstd mediatype. RFC 8478 (2018)5. Collin, L.: Xz utils. URL http://tukaani.org/xz/ (2020)6. De Maio, N., Walker, C., Borges, R., Weilguny, L., Slodkowicz, G., Goldman, N.:Issues with sars-cov-2 sequencing data. virological (2020)7. Grumbach, S., Tahi, F.: Compression of dna sequences. In: [Proceedings] DCC93:Data Compression Conference. pp. 340–350. IEEE (1993)8. Grumbach, S., Tahi, F.: A new challenge for compression algorithms: genetic se-quences. Information Processing & Management (6), 875–886 (1994)9. Gr¨unwald, P.D.: The minimum description length principle. MIT press (2007)10. Hosseini, M., Pratas, D., Pinho, A.J.: A survey on data compression methods forbiological sequences. Information (4), 56 (2016)11. Hutter, M.: The human knowledge compression prize. URLhttp://prize.hutter1.net (2006)12. Karolchik, D., Baertsch, R., Diekhans, M., Furey, T.S., Hinrichs, A., Lu, Y., Roskin,K.M., Schwartz, M., Sugnet, C.W., Thomas, D.J., et al.: The ucsc genome browserdatabase. Nucleic acids research (1), 1–7 (1965) I. Liiv15. Kryukov, K., Ueda, M.T., Nakagawa, S., Imanishi, T.: Sequence compressionbenchmark (scb) database—a comprehensive evaluation of reference-free compres-sors for fasta-formatted sequences. GigaScience (7), giaa072 (2020)16. Li, M., Vit´anyi, P.: An introduction to Kolmogorov complexity and its applications.Springer (1997)17. Liiv, I.: Sars-cov-2 coronavirus data compression benchmark. URLhttps://coronavirus.innar.com (2020)18. Mahoney, M.: Paq8 data compression program. URL http://mattmahoney.net/dc/(2007)19. Mahoney, M.: Rationale for a large text compression benchmark. URLhttp://mattmahoney.net/dc/rationale.html (2009)20. Meacham, F., Boffelli, D., Dhahbi, J., Martin, D.I., Singer, M., Pachter, L.: Identi-fication and correction of systematic error in high-throughput sequence data. BMCbioinformatics (11), giaa119 (2020)29. Wu, F., Zhao, S., Yu, B., Chen, Y.M., Wang, W., Song, Z.G., Hu, Y., Tao, Z.W.,Tian, J.H., Pei, Y.Y., et al.: A new coronavirus associated with human respiratorydisease in china. Nature579