[PDF] Estimation of genome size using k-mer frequencies from corrected long reads

Abstract

Full PDF

EEstimation of genome size using k -mer frequencies from corrected long reads Hengchao Wang, Bo Liu, Yan Zhang, Fan Jiang, Yuwei Ren, Lijuan Yin, Hangwei Liu, Sen Wang and Wei Fan* Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, 518120, China. *e-mail: [email protected]

Abstract:

The third-generation long reads sequencing technologies, such as PacBio and Nanopore, have great advantages over second-generation Illumina sequencing in de novo assembly studies. However, due to the inherent low base accuracy, third-generation sequencing data cannot be used for k -mer counting and estimating genomic profile based on k -mer frequencies. Thus, in current genome projects, second-generation data is also necessary for accurately determining genome size and other genomic characteristics. We show that corrected third-generation data can be used to count k -mer frequencies and estimate genome size reliably, in replacement of using second-generation data. Therefore, future genome projects can depend on only one sequencing technology to finish both assembly and k -mer analysis, which will largely decrease sequencing cost in both time and money. Moreover, we present a fast light-weight tool kmerfreq and use it to perform all the k -mer counting tasks in this work. We have demonstrated that corrected third-generation sequencing data can be used to estimate genome size and developed a new open-source C/C++ k -mer counting tool, kmerfreq, which is freely available at https://github.com/fanagislab/kmerfreq. n recent years, third-generation sequencing (TGS) technologies, such as Pacific Biosciences (PacBio), have become dominant in de novo assembly of large genomes. TGS is typically known as real-time single-molecule sequencing, which uses native DNA fragments to sequence instead of template amplification, avoiding copying errors, sequence-dependent biases and information losses(1). The development of TGS assembly algorithms is also blooming, and several TGS assembly tools such as HGAP(2), Canu(3), Falcon(4), Miniasm(5), MECAT(6), wtdbg2(7) and Flye(8), have become mature and been widely adopted. Utilizing advantages of even genomic coverage and ultra-long read length, continuity of TGS assembly has surpassed any sequencing technology developed before. Although assembly quality has been largely improved with TGS, until now, there is no assembler that can generate complete genome for large plant or animal genomes with complex structure, and most genome projects still need to perform an assembly-independent estimation of genomic characteristics based on k -mer frequencies. With k -mer frequencies counted from raw reads, probability models can be built to estimate genome size, repeat content, and heterozygous rate(9, 10). In addition, it has been reported that accuracy of this method is higher than that of traditional golden standard with DNA flow cytometry(11), making it a necessary and valuable analysis in de novo genome studies. However, due to inherent low base accuracy(2, 12, 13), TGS data cannot be used for k -mer counting and estimating genomic profile based on k -mer frequencies. In fact, k -mer related analysis has always been dominated by second-generation sequencing (SGS) data, especially Illumina. Thus, in current genome projects, SGS data is also necessary for accurately estimating genome size and other genomic characteristics, increasing cost of both time and money. Even though raw TGS reads have much lower base accuracy than Illumina raw reads, the accuracy of TGS reads could be significantly improved by error correction, utilizing the multiplicity and consensus information in TGS data(14). Here, we propose a k -mer frequency based genome size estimation method with corrected TGS reads, in replacement of using SGS reads traditionally. To confirm the feasibility, we tested this method on both simulated and real data. Moreover, we present a light-weight k -mer counting program, kmerfreq, to facilitate k -mer counting in this work. ethods and Results Overview and working principle of kmerfreq k -mer counting that aims to determine the frequency of k -length substrings ( k -mers) in the sequencing data, is a basic tool for k -mer frequency based estimation and also a frequent job in many bioinformatics applications. There are several available tools, which can be classified into 4 major types by underlying algorithms, including hash table, sorting, burst tries and enhanced suffix array(15). Although the efficiency of recently published tools such as KMC3 have been largely improved, counting k -mers from large amount of reads data is still not a trivial task. For genomic characteristics estimation, de novo genome studies often use smaller k -mer size, mostly 17. One reason is that the total k -mer space (4 =16 G) is enough larger than the genome size of most common genomes and thus has the ability to store all the k -mers derived from the genomes; Another reason is that using a larger k -mer size will result in more erroneous k -mers caused by sequencing errors and then decrease efficiency of this method. In other words, the higher error rate in the sequencing data, the smaller k -mer size should be used. For this purpose, we developed a fast light-weight tool, kmerfreq, to perform k -mer counting specialized for smaller k -mer sizes (< 19), and use it to count all the k -mer frequencies shown in this work. kmerfreq operates with a fixed memory size in parallel computational mode. It adopts a dynamic array to store the frequency value of all potential k -mers with size k, using two-bytes to store each frequency value, and taking the k -mer bit-value converted from the k -mer sequence as index of the frequency array, so the total memory usage is 2 ´ k bytes. It uses the main thread to load data from disk into memory, and multiple children threads to count k -mers frequency with lock-free CAS (compare and swap) operations simultaneously. Moreover, the k -mer chopping and bit-value converting method are also optimized by exploiting the property that two successive k -mers share a (k -1) bases to enhance speed efficiency. We compared the performance of kmerfreq with other published tools at k -mer size 17, using 30 X (90 G) human Illumina data simulated by pIRS(16). kmerfreq uses moderate 32 G memory, and its speed is much higher than most published tools and commensurate with that of KMC3(17), which has been the fastest k -mer counting tool available (Table 1 and Supplementary Table 1). It is worth to note hat kmerfreq does not utilize disk to process temporary results and do not output any middle result into disk. As a result, it has higher disk efficiency than KMC3 and other tools, and is more convenient for users. Genome size estimation for synthetic datasets

The distribution of k -mer frequencies is highly influenced by the level of sequencing errors. As we known, there is a huge gap between accuracy level of TGS and SGS data, and it is not sure what level of base accuracy could be suitable for k -mer frequency based estimation. Theoretically, if signal-to-noise ratio is high enough, i.e. if peaks that formed by random sampling of genomic k -mers could be clearly observed from k -mer frequency curve, then k -mer frequency based estimations would work. To evaluate, we simulated a set of human PacBio data with gradient accuracy by PBSIM(18) and counted k -mer frequencies (k = 17) independently. From the distribution curves (Fig. 1a), we can infer, when base accuracy level is higher than 96%, sequencing data could be potentially used for k -mer frequency based estimations. Further analysis confirmed this hypothesis and showed that the higher base accuracy, the more accurate genome size estimation (Supplementary Table 2). Furthermore, we also confirmed that PacBio circular consensus sequencing (CCS) reads with over 99% average base accuracy can be used to estimate genome size with high accuracy (Supplementary Fig. 1, Supplementary Table 2 and 6). On the contrary, current TGS data only has 85% base accuracy, and error correction that aims to increase accuracy level must be performed, in order to use TGS data for k -mer frequency based estimations. Luckily, there are already several available tools for TGS error correction, such as the built-in tool in Canu package(3). Firstly, we applied this method to simulated data of 9 model species, including 1 bacterium ( Escherichia coli ), 1 fungus (

Saccharomyces cerevisiae ), 5 animals (

Caenorhabditis elegans , Drosophila melanogaster , Danio rerio , Gallus Gallus , Homo sapiens ), and 2 plants (

Arabidopsis thaliana , Oryza sativa ) (Supplementary Table 3). Considering no heterozygosity, Illumina data were simulated by pIRS, PacBio data were simulated by PBSIM (Supplementary Table 4), and corrected PacBio data were generated by a built-in tool in Canu. On distribution curves of k -mer frequency, there s no obvious peak for raw PacBio data, in contrast, clear peaks could be observed for corrected PacBio data, similar to that of Illumina data (Fig. 1b and Supplementary Fig. 2). Genome size estimations with GCE method(9) and a naive method that directly uses total k -mer individuals divided by the major peak value, both produced highly accurate genome sizes, almost all with error rate less than 3%, and the difference between using corrected PacBio and Illumina data is also less than 3% (Supplementary Table 5), suggesting that this method is highly accurate for simulated data. Genome size estimation for genuine datasets

Next, we extended this method to real sequencing data of 11 species, including 6 animals (

Caenorhabditis nigoni , Taenia multiceps , Pyrocoelia pectoralis , Sillago sinica , Chaenocephalus aceratus , Aedes aegypti ) and 5 plants (

Durio zibethinus , Oropetium thomaeum , Mikania micrantha , Panicum miliaceum , Cinnamomum kanehirae ) (Supplementary Table 6 and 7). Besides sequencing errors, real sequencing data also has heterozygosity and coverage bias problems, which complicated k -mer frequency curves. As expected, all curves of PacBio data do not show any peak, while corrected PacBio curves show similar peaks to that of Illumina data (Fig. 1c and Supplementary Fig. 3). However, there is a bigger difference on the curve shapes for real data than simulated data. Genome size estimations with naive and GCE method also showed larger difference between using corrected PacBio and Illumina data, but for most genomes corrected PacBio data can produce reliable genome size estimations (Supplementary Table 8), except for some genomes with extremely complex structure, such as Durio zibethinus and

Mikania micrantha , due to high heterozygous rate, suggesting that this method is also feasible for real sequencing data of most species.

Discussion

The cost and throughput of TGS technologies have been dramatically improved(19), approaching to that of SGS technologies. Thus, large-scale replacement of SGS with TGS occurs in most bioinformatics applications, especially in de novo genome studies. Although assembly continuity has been greatly enhanced with TGS data, k -mer frequency based estimations can still provide a beneficial upplementary information for genome profile. Here, we only tested with PacBio technology, but there should be no problem to extend the method to Oxford Nanopore Technologies (ONT) or other TGS technologies. The key and speed-limited step is error correction of TGS data, however, this situation may be changed by the endeavor to increase base accuracy of TGS, such as CCS achieved by Sequel System 6.0, and ONT R10 with a longer barrel and dual reader head. With increased base accuracy of raw data, it will become easier for error correction of TGS data, making k -mer frequency based estimation more practicable and accurate in the near future. Conclusion

We proposed and demonstrated that corrected TGS data can be applied for k -mer counting and genomic profile estimation, in replacement of using SGS data traditionally. Therefore, future genome projects can depend on only one TGS technology to finish both major assembly and supplementary k -mer frequency analysis, which will largely decrease sequencing cost in both time and labour. Moreover, we present a fast light-weight k -mer counting tool, kmerfreq, which is comparable in speed to the fastest k -mer counting tool available and also have several additional advantages to facilitate genome estimation. Acknowledgements

The work was supported by Shenzhen science and technology program (JCYJ20170303154245825); The Agricultural Science and Technology Innovation Program Cooperation and Innovation Mission (CAAS-XTCX2016).

Author contributions

The study was designed by W. F. H. W. performed benchmark of k -mer counting tools and wrote draft manuscript. B. L., Y. Z. and F. J. downloaded real sequencing data in published papers. Y. R., L. Y., H. L. and S. W. downloaded model genomes for simulation. All co-authors substantively revised the manuscript and approved the submitted version. Competing interests

The authors declare no competing interests.

References

1. Shendure J, Balasubramanian S, Church GM, Gilbert W, Rogers J, Schloss JA, et al:

DNA sequencing at 40: past, present and future . Nature . 2017;550:345. 2. Chin CS, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, et al:

Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data . Nature Methods . 2013;10(6):563-9. 3. Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM:

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.

Genome research . 2017;27(5):722-36. 4. Chin CS, Peluso P, Sedlazeck FJ, Nattestad M, Concepcion GT, Clum A, et al:

Phased diploid genome assembly with single-molecule real-time sequencing.

Nature Methods . 2016;13(12):1050-4. 5. Li H.

Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.

Bioinformatics (Oxford, England) . 2016;32(14):2103-10. 6. Xiao C-L, Chen Y, Xie S-Q, Chen K-N, Wang Y, Han Y, et al:

MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads.

Nature Methods . 2017;14:1072. 7. Ruan J, Li H:

Fast and accurate long-read assembly with wtdbg2.

Nature Methods . 2019. 8. Kolmogorov M, Yuan J, Lin Y, Pevzner PA:

Assembly of long, error-prone reads using repeat graphs.

Nature Biotechnology . 2019. 9. Liu B, Shi Y, Yuan J, Hu X, Zhang H, Li N, et al:

Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. arXiv e-prints [Internet]. 2013 August 01, 2013. Available from: https://ui.adsabs.harvard.edu/\

GenomeScope: fast reference-free genome profiling from short reads.

Bioinformatics (Oxford, England) . 2017;33(14):2202-4. 11. Sun H, Ding J, Piednoel M, Schneeberger K: findGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies.

Bioinformatics (Oxford, England) . 2018;34(4):550-7. 12. Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, et al:

Real-time DNA sequencing from single polymerase molecules.

Science . 2009;323(5910):133-8. 13. Schneider GF, Dekker C:

DNA sequencing with nanopores.

Nature Biotechnology . 2012;30(4):326-8. 14. Fu S, Wang A, Au KF:

A comparative evaluation of hybrid error correction methods for error-prone long reads.

Genome Biology . 2019;20(1):26. 15. Manekar SC, Sathe SR:

A benchmark study of k-mer counting methods for high-throughput equencing.

Gigascience . 2018;7(12). 16. Hu X, Yuan J, Shi Y, Lu J, Liu B, Li Z, et al: pIRS: Profile-based Illumina pair-end reads simulator.

Bioinformatics (Oxford, England) . 2012;28(11):1533-5. 17. Kokot M, Długosz M, Deorowicz S:

KMC 3: counting and manipulating k-mer statistics.

Bioinformatics (Oxford, England) . 2017;33(17):2759-61. 18. Ono Y, Asai K, Hamada M:

PBSIM: PacBio reads simulator--toward accurate genome assembly.

Bioinformatics (Oxford, England) . 2013;29(1):119-21. 19. De Coster W, De Rijk P, De Roeck A, De Pooter T, D'Hert S, Strazisar M, et al:

Structural variants identified by Oxford Nanopore PromethION sequencing of the human genome.

Genome research . 2019;29(7):1178-87.

Table 1. k -mer counting performance of various tools. Softwares Wall time (s) Max memory (G) Disk space (G) kmerfreq 1,887 32 0.0032 KMC3 1,948 30 12 Jellyfish 7,024 30 24 DSK 3,283 25 43 KCMBT 3,016 101 22 BFCounter 65,061 22 51 GenomeTester4 4,540 192 32 tallymer 116,361 199 842 KAnalyze 348,912 10 55 All tools are run in triplicate and the average of each feature are shown, except KAnalyze, which run one time, because it takes too much time to run and we suppose its statistics variation should be small. All the experiments, if not specifically mentioned, are conducted by 4 threads on a server with 2.00GHz Intel Xeon CPU E7-4830 v4 on CentOS release 6.6 system. We compared the performance of kmerfreq with other published tools at k -mer size 17, using 30 X (90 G) human Illumina data simulated by pIRS. Figure 1. The curves of k -mer distribution. (a) The distribution of k -mer frequencies from 50 X simulated PacBio datasets of Homo Sapiens with different accuracy. (b) The distribution of k -mer frequencies from 45 X simulated Illumina, corrected PacBio and 100 X raw PacBio datasets of Homo Sapiens . (c) The distribution of k -mer frequencies from 40 XIllumina, corrected PacBio and 125 X raw PacBio datasets of Taenia multiceps . upplementary materials

1. Supplementary methods k -mer counting method kmerfreq counts k -mer (with size k ) frequency from the input sequence data, typically sequencing reads data, and reference genome data is also applicable. The forward and reverse strand of a k -mer are taken as the same k -mer, and only the k -mer strand with smaller bit-value is used to represent the k -mer. It adopts a 16-bit integer with max value 65535 to store the frequency value of a unique k -mer, and any k -mer with frequency larger than 65535 will be recorded as 65535. The program stores all k -mer frequency values in a 4^ k size array of 16-bit integer (2 bytes), using the k -mer bit-value as index, so the total memory usage is 2*4^ k bytes. For k -mer size 15, 16, 17, 18, 19, it will consume constant 2G, 8G, 32G, 128G, 512G memory, respectively. kmerfreq works in a highly simple and parallel style, to achieve as fast speed as possible. One of noteworthy method is k -mer bit calculation, which exploits the property that two successive k -mers share a ( k -1) bases, to transfer read sequence into k -mer bit value. For each read, the first k -mer sequence is chopped and transferred into bit value. For next k -mer in this read, we do not chop sequence and only need to transfer the newly added base into bit value and combine with the last bit value with bit operations. In summary, kmerfreq has some important features: (1) Fixed memory: stores all k -mer frequency values in a 4^ k size array of 16-bit integer (2 bytes), using the k -mer bit-value as index. adopts a 16-bit integer with max value 65535 to store the frequency value of a unique k -mer. (2) Highly parallel: main thread load data, while the children threads simultaneously count k -mers without waiting, using lock free technology as Jellyfish. 3) k -mer operation: advanced method of chopping reads into k -mers, and converting k -mer sequence to k -mer bit values. (4) Diverse output: Distribution of k -mer frequency (default), k -mer sequence and frequency values with a cutoff (optional), compressed data structure [one-bit for a unique k -mer, 0 for low-frequency, 1 for high-frequency] (optional) k -mer counting tools 1.2.1 Download addresses: kmerfreq 4.0 ceffd13 https://github.com/fanagislab/kmerfreq KMC3 3.1.0 4287983 https://github.com/refresh-bio/KMC Jellyfish 2.2.7 2cff4e4 https://github.com/gmarcais/Jellyfish DSK 2.1.0 https://github.com/GATB/dsk KAnalyze 2.0.0 https://sourceforge.net/projects/kanalyze/files/v2.0.0/ KCMBT 1.0.0 b4d684a https://github.com/abdullah009/kcmbt_mt GenomeTester4 4.0 a9c14a6 https://github.com/bioinfo-ut/GenomeTester4 BFCounter 1.0 f7f96a1 https://github.com/pmelsted/BFCounter genometools-1.5.10 (tallymer) 1.5.10 http://genometools.org/pub/ runit.pl 5e577d7 https://github.com/ruanjue/wtdbg2/blob/master/scripts/runit.pl Canu 1.8 22a2453 https://github.com/marbl/canu Ø kmerfreq ./kmerfreq -k 17 -t 4 -p human_k17_t4 reads.lib Note that reads.lib includes pair end simulated fastq files. Ø kmc3(1) ./kmc -v -k17 -m32 -fq -ci1 -t4 -cs65535 @reads.lib human_k17_t4 ./bins /kmc_tools transform human_k17_t4 histogram human_k17_t4_histo.txt Ø jellyfish(2) ./jellyfish count -m 17 -s 10G -t 4 -o human_k17_t4.jf -C ./*fq ./jellyfish histo -t 4 human_k17_t4.jf > human_k17_t4.jf.histo Ø dsk(3) ./dsk -file reads.lib -kmer-size 17 -verbose 2 -nb-cores 4 -out human_k17_t4 -abundance-min-threshold 1 -max-memory 32000 -max-disk 200000 -abundance-min 1 Ø kcmbt(4) ./kcmbt -k 17 -t 4 -i @reads.lib ./kcmbt_dump 4 Note that reads.lib includes pair end simulated fastq files. Ø BFCounter(5) ./BFCounter count -k 17 -t 4 -o human_k17_t4 ./Homo_simulated_illumina_30X_150_350_1.fq ./Homo_simulated_illumina_30X_150_350_2.fq -n 4000000000 ./BFCounter dump -k 17 -i human_k17_t4 -o human_k17_t4.dump Ø GenomeTester4(6) ./glistmaker ./Homo_simulated_illumina_30X_150_350_1.fq ./Homo_simulated_illumina_30X_150_350_2.fq -w 17 --num_threads 4 -o human_k17_t4 Ø tallymer(7) /gt suffixerator -dna -pl -tis -suf -lcp -v -parts 4 -indexname human_30x -db ./Homo_simulated_illumina_30X_150_350_1.fq ./Homo_simulated_illumina_30X_150_350_2.fq ./gt tallymer mkindex -scan -mersize 17 -esa human_30x Ø Kanalysis(8) java -Xmx32G -jar kanalysis-2.0.0/kanalyze.jar count -f fastq -k 17 -o human_k17_t4 -rcanonical -t 4 ./Homo*.fq . Supplementary tables Table 1. Benchmark of k-mer counting tools

Software Wall time (s) User time (s) System time (s) CPU time (s) Maxrss (kb) Maxvsz (kb) Disk kmerfreq 1,887 7,025 101 7,077 33,565,732 33,883,036 3.2M KMC3 1,948 6,588 305 6,893 31,389,429 32,040,304 12G Jellyfish 7,024 27,109 157 26,999 31,961,517 32,277,624 24G DSK 3,283 12,118 441 12,559 25,711,087 39,644,856 43G KCMBT 3,016 5,832 737 6,569 105,861,517 179,434,071 22G BFCounter 65,061 123,303 207 123,510 23,185,739 23,412,661 51G GenomeTester4 4,540 13,306 92 13,398 200,881,667 247,748,893 32G tallymer 116,361 113,817 2,307 116,124 208,334,960 208,402,364 842G KAnalyze 348,912 751,653 224,463 976,116 10,580,400 49,545,668 55G All tools are run in triplicate and the average of each feature are shown, except KAnalyze, which run one time, because it takes too much time to run and we suppose its statistics variation should be small. All the experiments, if not specifically mentioned, are conducted by 4 threads on a server with 2.00GHz Intel Xeon CPU E7-4830 v4 on CentOS release 6.6 system. Maxrss (non-swapped physical memory used). Maxvsz (max virtual size). We compared the performance of kmerfreq with other published tools at k -mer size 17, using 30 X (90 G) human Illumina data simulated by pIRS(9). Table 2. Genome size estimated by naive method and GCE method genome size CCS dataset 96% accuracy 97% accuracy 98% accuracy 99% accuracy Estimated genome size by naive method 3,095,950,843 3,111,052,276 3,925,043,827 3,584,550,659 3,415,046,927 3,269,115,270 Estimated genome size by GCE 2,975,070,000 4,061,810,000 3,814,300,000 2,985,250,000 3,172,110,000

Naive method directly uses total number of k -mers divided by the major peak value in the k -mer frequency curve. able 3. Model genomes used in this paper Sample Species Reference genome

H. sap Homo sapiens

GRCh38.p12

A. tha Arabidopsis thaliana

TAIR10.1

C. ele

Caenorhabditis elegans WBcel235

D. rer Danio rerio

GRCz11

D. mel Drosophila melanogaster

Release 6 plus ISO1 MT

E. col Escherichia coli

ASM584v2

G. gal Gallus Gallus

GRCg6a

O. sat Oryza sativa

Build 4.0

S. cer Saccharomyces cerevisiae

R64 The genomic sequences are used to simulate sequencing datasets.

Table 4. Simulated datasets of model genomes

Sample No. of PacBio reads No. of PacBio bases PacBio mean read length No. of Illumina reads No. of Illumina bases Illumina read length

H. sap

A. tha

C. ele

D. rer

D. mel

E. col

G. Gal

O. sat

S. cer able 5. Genome estimation from simulated datasets of model genomes

Sample genome size naive method GCE method Corrected PacBio (% of difference) Illumina (% of difference) Corrected PacBio (% of difference) Illumina (% of difference)

H. sap

A. tha

C. ele

D. rer

D. mel

E. col

G. gal

O. sat

S. cer k -mers divided by the major peak value in the k -mer frequency curve. % of difference = |" . able 6. Genuine datasets of recently published papers used in this paper Sample Species Illumina accession PacBio accession Citation Reference Genome Contig N50 Size

C. nig Caenorhabditis nigoni

SRX3302872 SRX3302873 (10) nigoni.pc_2016.07.14 3.3 Mb

D. zib Durio zibethinus (fruit durian) SRX3204603 SRX3185921 (11) Duzib1.0 549,783

T. mul Taenia multiceps (tapeworm) SRX1531874, SRX1626286, SRX1626287, SRX1626288 SRX1531851, SRX3851532, SRX3851533, SRX3856587 (12) Gns01 756,412

O. tho Oropetium thomaeum

SRX1078264~ SRX1078287 SRX1055085 (13) GCA_001182835.1 2,386,382

M. mic Mikania micrantha

SRR8835135~ SRR8835137 SRR8834228~ SRR8834566 (14) ASM936387v1 1.35 Mb

P. mil Panicum miliaceum (broomcorn millet) SRX3628188 SRX3628187 (15) Pm_0390_v2 368,640

C. kan Cinnamomum kanehirae (Stout camphor tree) SRX4287864 SRX4287873 (16) ASBRC_Ckan_1.0 498,920

P. pec Pyrocoelia pectoralis (firefly) SRX3048410 SRX3048411 (17) NA 3.04 Mb

S. sin Sillago sinica (Chinese sillago) SRX3907118, SRX3907120~ SRX3907123 SRX3907119, SRX3907114~ SRX3907117 (18) NA 2.6 Mb

C. ace Chaenocephalus aceratus (Antarctic blackfin icefish) ERR1473912~ ERR1473914 SRR6942631, SRR6942632 (19) NA 1.5 Mb

A. aeg Aedes aegypti (yellow fever mosquito) SRX3231347 SRX2998003~ SRX2998178 (20) AaegL5.0 11,758,062

H. sap Homo sapiens (human) CCS dataset https://downloads.pacbcloud.com/public/dataset/HG002_SV_and_SNV_CCS/consensusreads/ NA: not available. able 7. Genuine genome datasets in recently published papers

Species No. of PacBio reads No. of PacBio bases PacBio mean read length No. of Illumina reads No. of Illumina bases Illumina mean read length

C. nig

D. zib

T. mul

O. tho

M. mic

P. mil

C. kan

P. pec

S. sin

C. ace

A. aeg able 8. Genome size estimation for genuine datasets

Sample Reported genome size Naive method GCE method Corrected PacBio (% of difference) Illumina (% of difference) Corrected PacBio (% of difference) Illumina (% of difference)

C. nig

D. zib

T. mul

O. tho

M. mic

P. mil

C. kan

P. pec

S. sin

C. ace

A. aeg

A. aeg and

P. pec have very high heterozygous rate, thus we run GCE with “-H 1 -c peak”. Here, “peak” means homogenous peak of k -mer frequency curve. Naive method directly uses total number of k -mers divided by the major peak value in the k -mer frequency curve. For real datasets, accurate genome size could not be known, and we just took the estimation in published papers as a comparison. % of difference = |" . . Supplementary figures Figure 1. The distribution of k -mer frequency for CCS dataset of human genome. The peak frequency is 14 and the estimated genome size is 3,111,052,276. P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

CCS dataset from Homo sapiens

Figure 2. The curves of k -mer frequency in model genomes from simulated illumina, raw pacbio and corrected pacbio data. (a) The distributions of k -mer frequency of Arabidopsis thaliana from illumina, raw pacbio and corrected pacbio data respectively. (b) The distributions of k -mer frequency of Caenorhabditis elegans from illumina, raw pacbio and corrected pacbio data respectively. (c) The distributions of k -mer frequency of Danio rerio from illumina, raw pacbio and corrected pacbio data respectively. (d) The distributions of k -mer frequency of Drosophila melanogaster from illumina, raw pacbio and corrected pacbio data respectively. (e) The distributions of k -mer frequency of Escherichia coli from illumina, raw pacbio and corrected pacbio data respectively. (f) The distributions of k -mer frequency of Gallus Gallus from illumina, raw pacbio and corrected pacbio data respectively. (g) The distributions of k -mer frequency of Oryza sativa from illumina, raw pacbio and corrected pacbio data respectively. (h) The distributions of k -mer frequency of Saccharomyces cerevisiae from illumina, raw pacbio and corrected pacbio data respectively. P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Simulated data from Arabidopsis thaliana illuminacorrected pacbioraw pacbio P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Simulated data from Caenorhabditis elegans illuminacorrected pacbioraw pacbio P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Simulated data from Danio rerio illuminacorrected pacbioraw pacbio P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Simulated data from Drosophila melanogaster illumina corrected pacbio raw pacbio P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Simulated data from Escherichia coli illumina corrected pacbio raw pacbio P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Simulated data from Gallus gallus illumina corrected pacbio raw pacbio P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Simulated data from Oryza sativa illumina corrected pacbio raw pacbio P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Simulated data from Saccharomyces cerevisiae illumina corrected pacbio raw pacbio (a) (b) (c)(d) (e) (f)(g) (h)

Figure 3.

The curves of k -mer frequency in genuine genomes from illumina, raw pacbio and corrected pacbio data. (a) The distributions of k -mer frequency of Caenorhabditis nigoni from illumina, raw pacbio and corrected pacbio data respectively. (b) The distributions of k -mer frequency of Durio zibethinus from illumina, raw pacbio and corrected pacbio data respectively. (c) The distributions of k -mer frequency of Oropetium thomaeum from illumina, raw pacbio and corrected pacbio data respectively. (d) The distributions of k -mer frequency of Mikania micrantha from illumina, raw pacbio and corrected pacbio data respectively. (e) The distributions of k -mer frequency of Panicum miliaceum from illumina, raw pacbio and corrected pacbio data respectively. (f) The distributions of k -mer frequency of Cinnamomum kanehirae from illumina, raw pacbio and corrected pacbio data respectively. (g) The distributions of k -mer frequency of Pyrocoelia pectoralis from illumina, raw pacbio and corrected pacbio data respectively. (h) The distributions of k -mer frequency of Sillago sinica from illumina, raw pacbio and corrected pacbio data respectively. (i) The distributions of k -mer frequency of P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Real data from Panicum miliaceum illumina corrected pacbio raw pacbio P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Real data from Caenorhabditis nigoni illumina corrected pacbio raw pacbio P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Real data from Durio zibethinus illumina corrected pacbio raw pacbio P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Real data from Oropetium thomaeum illumina corrected pacbio raw pacbio (a) (b)(e) (c) P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Real data from Mikania micrantha illumina corrected pacbio raw pacbio (d) P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Real data from Cinnamomum kanehirae illumina corrected pacbio raw pacbio P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Real data from Pyrocoelia pectoralis illumina corrected pacbio raw pacbio P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Real data from Sillago sinica illumina corrected pacbio raw pacbio P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Real data from Chaenocephalus aceratus illumina corrected pacbio raw pacbio P e r ce n t i n t o t a l ( % ) Kmer frequency (K = 17)

Real data from Aedes aegypti illumina corrected pacbio raw pacbio (f)(g) (h)(j) (i) haenocephalus aceratus from illumina, raw pacbio and corrected pacbio data respectively. (j) The distributions of k -mer frequency of Aedes aegypti from illumina, raw pacbio and corrected pacbio data respectively.

Supplementary References

1. Kokot M, Długosz M, Deorowicz S:

KMC 3: counting and manipulating k-mer statistics.

Bioinformatics (Oxford, England) . 2017;33(17):2759-61. 2. Marcais G, Kingsford C:

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

Bioinformatics (Oxford, England) . 2011;27(6):764-70. 3. Rizk G, Lavenier D, Chikhi R:

DSK: k-mer counting with very low memory usage.

Bioinformatics (Oxford, England) . 2013;29(5):652-3. 4. Mamun AA, Pal S, Rajasekaran S:

KCMBT: a k-mer Counter based on Multiple Burst Trees.

Bioinformatics (Oxford, England) . 2016;32(18):2783-90. 5. Melsted P, Pritchard JK:

Efficient counting of k-mers in DNA sequences using a bloom filter.

BMC bioinformatics . 2011;12:333. 6. Kaplinski L, Lepamets M, Remm M:

GenomeTester4: a toolkit for performing basic set operations - union, intersection and complement on k-mer lists.

Gigascience . 2015;4:58. 7. Kurtz S, Narechania A, Stein JC, Ware D:

A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes.

BMC genomics . 2008;9:517. 8. Audano P, Vannberg F:

KAnalyze: a fast versatile pipelined k-mer toolkit.

Bioinformatics (Oxford, England) . 2014;30(14):2070-2. 9. Hu X, Yuan J, Shi Y, Lu J, Liu B, Li Z, et al: pIRS: Profile-based Illumina pair-end reads simulator.

Bioinformatics (Oxford, England) . 2012;28(11):1533-5. 10. Yin D, Schwarz EM, Thomas CG, Felde RL, Korf IF, Cutter AD, et al:

Rapid genome shrinkage in a self-fertile nematode reveals sperm competition proteins.

Science . 2018;359(6371):55-61. 11. Teh BT, Lim K, Yong CH, Ng CCY, Rao SR, Rajasegaran V, et al:

The draft genome of tropical fruit durian (Durio zibethinus) . Nature Genetics . 2017;49(11):1633-41. 12. Li W, Liu B, Yang Y, Ren Y, Wang S, Liu C, et al:

The genome of tapeworm Taenia multiceps sheds light on understanding parasitic mechanism and control of coenurosis disease.

DNA research . 2018;25(5):499-510. 13. VanBuren R, Bryant D, Edger PP, Tang H, Burgess D, Challabathula D, et al:

Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum.

Nature . 2015;527(7579):508-11. 14. Liu B, Yan J, Li W, Yin L, Li P, Yu H, et al:

Mikania micrantha genome provides insights into the molecular mechanism of rapid growth.

Nature Communications . 2020;11(1):340. 5. Zou C, Li L, Miki D, Li D, Tang Q, Xiao L, et al:

The genome of broomcorn millet.

Nature Communications . 2019;10(1):436. 16. Chaw SM, Liu YC, Wu YW, Wang HY, Lin CI, Wu CS, et al:

Stout camphor tree genome fills gaps in understanding of flowering plant genome evolution.

Nature Plants . 2019;5(1):63-73. 17. Fu X, Li J, Tian Y, Quan W, Zhang S, Liu Q, et al:

Long-read sequence assembly of the firefly Pyrocoelia pectoralis genome.

Gigascience . 2017;6(12):1-7. 18. Xu S, Xiao S, Zhu S, Zeng X, Luo J, Liu J, et al:

A draft genome assembly of the Chinese sillago (Sillago sinica), the first reference genome for Sillaginidae fishes.

Gigascience . 2018;7(9). 19. Kim BM, Amores A, Kang S, Ahn DH, Kim JH, Kim IC, et al:

Antarctic blackfin icefish genome reveals adaptations to extreme environments.

Nature Ecology & Evolution . 2019;3(3):469-78. 20. Matthews BJ, Dudchenko O, Kingan SB, Koren S, Antoshechkin I, Crawford JE, et al:

Improved reference genome of Aedes aegypti informs arbovirus vector control.

Nature . 2018;563(7732):501-7.. 2018;563(7732):501-7.