[PDF] An Improved Filtering Algorithm for Big Read Datasets

Abstract

For single-cell or metagenomic sequencing projects, it is necessary to sequence with a very high mean coverage in order to make sure that all parts of the sample DNA get covered by the reads produced. This leads to huge datasets with lots of redundant data. A filtering of this data prior to assembly is advisable. Titus Brown et al. (2012) presented the algorithm Diginorm for this purpose, which filters reads based on the abundance of their k -mers. We present Bignorm, a faster and quality-conscious read filtering algorithm. An important new feature is the use of phred quality scores together with a detailed analysis of the k -mer counts to decide which reads to keep. With recommended parameters, in terms of median we remove 97.15% of the reads while keeping the mean phred score of the filtered dataset high. Using the SDAdes assembler, we produce assemblies of high quality from these filtered datasets in a fraction of the time needed for an assembly from the datasets filtered with Diginorm. We conclude that read filtering is a practical method for reducing read data and for speeding up the assembly process. Our Bignorm algorithm allows assemblies of competitive quality in comparison to Diginorm, while being much faster. Bignorm is available for download at this https URL

Full PDF

AAn Improved Filtering Algorithmfor Big Read Datasets

Axel Wedemeyer ˚ , Lasse Kliemann : , Anand Srivastav , Christian Schielke ,Thorsten B. Reusch , and Philip Rosenstiel Department of Computer Science, Kiel University, Christian-Albrechts-Platz 4,24118 Kiel, Germany Marine Ecology, GEOMAR Helmholtz Centre for Ocean Research Kiel,Düsternbrooker Weg 20, 24105 Kiel, Germany Institute of Clinical Molecular Biology, Kiel University, Schittenhelmstr. 12, 24105 Kiel,GermanyNovember 8, 2018

Abstract

For single-cell or metagenomic sequencing projects, it is necessary to sequence witha very high mean coverage in order to make sure that all parts of the sample DNA getcovered by the reads produced. This leads to huge datasets with lots of redundant data. Aﬁltering of this data prior to assembly is advisable. Titus Brown et al. (2012) presented thealgorithm Diginorm for this purpose, which ﬁlters reads based on the abundance of their k -mers. We present Bignorm, a faster and quality-conscious read ﬁltering algorithm. Animportant new feature is the use of phred quality scores together with a detailed analysis ofthe k -mer counts to decide which reads to keep. With recommended parameters, in termsof median we remove 97.15% of the reads while keeping the mean phred score of the ﬁltereddataset high. Using the SDAdes assembler, we produce assemblies of high quality from theseﬁltered datasets in a fraction of the time needed for an assembly from the datasets ﬁlteredwith Diginorm. We conclude that read ﬁltering is a practical method for reducing readdata and for speeding up the assembly process. Our Bignorm algorithm allows assembliesof competitive quality in comparison to Diginorm, while being much faster. Bignorm isavailable for download at https://git.informatik.uni-kiel.de/axw/Bignorm.git ˚ [email protected] : [email protected] a r X i v : . [ q - b i o . GN ] O c t Background

Next generation sequencing systems (such as the Illumina platform) tend to produce anenormous amount of data — especially when used for single-cell or metagenomic protocols —of which only a small fraction is essential for the assembly of the genome. It is thus advisableto ﬁlter that data prior to assembly.

In order to describe our algorithm and its comparison, we need some formal deﬁnitions andconcepts. Denote N : “ { , , , . . . } the set of non-negative integers, and for each n P N denote r n s : “ { , . . . , n } the integers from 1 to n (including 1 and n ). Denote Σ : “ { A , C , G , T , N } thealphabet of nucleotides plus the symbol N used to indicate an undetermined base. By Σ ˚ wedenote all the ﬁnite strings over Σ, and for a k P N by Σ k all the strings over Σ of exactlylength k . For v P Σ ˚ , denote | v | P N its length and ¯ v P Σ ˚ its reverse complement. For v, w P Σ ˚ ,we write v – w if | v | “ | w | and the two strings are equal up to places where either of them hasthe N symbol.The input to the ﬁlter algorithm is a dataset D “ p n, m, R, Q q where for each i P r n s wehave: • m p i q P N : a ﬂag for an unpaired ( m p i q “

1) or paired ( m p i q “

2) dataset; • R p i, s q P Σ ˚ for each s P r m p i qs : the set of reads in the dataset; • Q p i, s q P Z | R p i,s q | for each s P r m p i qs : the set of corresponding phred scores .Each read i P r n s consists of m p i q read strings R p i, q , . . . , R p i, m p i qq . For t P r | R p i, s q | s “ { , . . . , | R p i, s q |} we denote the nucleotide at position t in read string R p i, s q by R t p i, s q and itsphred score by Q t p i, s q . Note that in terms of read strings, D may contain the “same” readmultiple times (perhaps with diﬀerent quality values), that is, there can be i ‰ j such that R p i q “ R p j q . Hence it is beneﬁcial that we refer to reads by their indices 1 , . . . , n .Denote x P Σ ˚ the genome from which the reads were obtained and g : “ | x | its length. (For thepurpose of this exposition, we simplify by assuming the genome is a single string.). For eachlocus ‘ P r g s , the coverage c ‘ p D q of ‘ with respect to D is informally described as the numberof read strings that were or could have been produced by the sequencing machine while readinga part of x that contains locus ‘ . More precisely, for each v P Σ ˚ deﬁne Ź c ‘ p v q : “ w of x which contains locus ‘ and satisﬁes v – w or v – ¯ w ; Ź c ‘ p v q : “ c ‘ p D q : “ n X i “ m p i q X s “ c ‘ p R p i, s qq A coverage of c ‘ p D q «

20 for each ‘ P r g s has been empirically determined as optimal for asuccessful assembly of x from D [37]. On the other hand, in many setups, the coverage for alarge number of loci is much higher than 20, often rising up to tens or hundreds of thousands,especially for single-cell or metagenomic protocols (see Table 3, “max” column for the maximalcoverage of the datasets that we use in our experiments). In order to speed up the assemblyprocess — or in extreme cases to make it possible in the ﬁrst place, given certain restrictions onavailable RAM and/or time — a sub-dataset D “ p n , m , R , Q q of D should be determinedsuch that n is much smaller than n while not losing essential information. The goal is thatusing D , an assembly of similar quality than using D is possible. We only consider the naturalapproach to create D by making a choice for each i P r n s whether to include read i in D or not, so in particular p R p q , . . . , R p n qq will be a sub-vector of p R p q , . . . , R p n qq . When weinclude a read in D , we also say that it is accepted , whereas when we exclude it, we say it is rejected . On an abstract level, a ﬁltered dataset based on D can be speciﬁed by giving a set ofindices A Ď r n s that consists of exactly the accepted reads.Many popular assemblers, such as SPAdes [16], Platanus [27], or Allpaths-LG [24], work withthe de Bruijn graph , that is based on k -mers. Fix a parameter k P N ; typically 21 ď k . The setof k -mers of a string v P Σ ˚ , denoted M p v, k q Ď Σ k , is the set of all strings of length k that aresubstrings of v . Sometimes we need to consider a k -mer multiple times if it occurs in multipleplaces in the string, and the corresponding set is denoted: M ˚ p v, k q : “ n p µ, p q P Σ k ˆ N ; µ is a substring of v starting at position p o For a read i P r n s and read string s P r m p i qs deﬁne M p i, s, k q : “ M p R p i, s q , k q and M p i, k q : “ ⋃ m p i q s “ M p i, s, k q , so M p i, k q are all the k -mers that occur in any of the m p i q read strings of R p i q .Denote also M ˚ p i, s, k q : “ M ˚ p R p i, s q , k q . We brieﬂy survey two prior approaches for read pre-processing, namely trimming and errorcorrection . Read trimming programms (see [21] for a recent review) try to cut away the lowquality parts of a read (or drop reads whose overall quality is low). These algorithms can beclassiﬁed in two groups: running sum (Cutadapt, ERNE, SolexaQA with -bwa option) [19, 31, 32]and window based (ConDeTri, FASTX, PRINSEQ, Sickle, SolexaQA, and Trimmomatic) [12,17, 19, 26, 35, 36]. The running sum algorithms take a quality threshold Q as input, which issubtracted from the phred score of each base of the read. The algorithms vary in the functionsapplied to the diﬀerences to determine the quality of a read, the direction in which the read isprocessed, the function’s quality threshold upon which the cutoﬀ point is determined, and theminimum length of a read after the cutoﬀ to be accepted.3he window based algorithms on the other hand ﬁrst cut away the reads’s 3’ or 5’ ends(depending on the algorithm) whose quality is below a speciﬁed minimum quality parameterand then determine a contiguous sequence of high quality using techniques similar to thoseused in the running sum algorithms.All of these trimming algorithms generally work on a per-read basis, reading the input onceand processing only a single read at a time. The drawback of this approach is that low qualitysequences within a read are being dropped even when these sequences are not covered by anyother reads whose quality is high. Also the phred score of a base is not independent betweenreads, i. e., a base whose phred score is low in one read is likely to have a low phred score inother reads as well and thus this low quality segment might get dropped altogether, creatinguncovered regions. On the other hand sequences whose quality and abundance are high areadded over and over although their coverage is already high enough, which yields higher memoryusage than necessary.Most of the error correction programs (see [15] for a recent review) read the input twice: a ﬁrstpass gathers statistics about the data (often k -mer counts) which in a second pass are usedto identify and correct errors. Some programs trim reads which connot be corrected. Again,coverage is not a concern: reads which seem to be correct or which can be corrected are alwaysaccepted. According to [15], the probably best known and most used error correction programis Quake [29]. Its algorithm is based on two assumptions: • “For suﬃciently large k , almost all single-base errors alter k -mers overlapping the error toversions that do not exist in the genome. Therefore, k -mers with low coverage, particularlythose occurring just once or twice, usually represent sequencing errors.” • Errors follow a Gamma distribution, whereas true k -mers are distributed as per a combinationof the Normal and the Zeta distribution.In the ﬁrst pass of the program, a score (based on the phred quality scores of the individualnucleotides) is computed for each k -mer. After this, Quake computes a coverage cutoﬀ value,that is, the local minimum of the k -mer spectrum between the Gamma and the Normal maxima.All k -mers having a score higher than the coverage cutoﬀ are considered to be correct ( trusted or solid in error correction terminology), the others are assumed to be erroneous. In a secondpass, Quake reads the input again and tries to replace erroneous k -mers by trusted ones usinga maximum likelihood approach. Reads which cannot be corrected are optionally trimmed ordumped.But the main goal of error correctors is not the reduction of the data volume (in particular,they do not pay attention to excessive coverage), hence they cannot replace the followingapproaches.Titus Brown et al. invented an algorithm named Diginorm [37, 39] for read ﬁltering that rejectsor accepts reads based on the abundance of their k -mers. The name Diginorm is a short formfor digital normalization : the goal is to normalize the coverage over all loci, using a computer4lgorithm after sequencing. The idea is to reject those reads which mainly bring k -mers thathave been seen many times in other reads already. Diginorm processes reads one by one. Letthe read currently processed be i P r n s . For each k -mer µ P Σ k , deﬁne c p µ, i q : “ (cid:12)(cid:12)(cid:12)n j P N ; p j ă i q and p µ P M p j, k qq and (read j was previously accepted) o(cid:12)(cid:12)(cid:12) ,which says in how many accepted reads we have seen the k -mer µ so far. In order to save RAM,Diginorm does not keep track of those numbers exactly, but instead keeps appropriate estimates b c p µ, i q using the count-min sketch (CMS) [18]. For each i P r n s and s P r m p i qs denote the vector C p i, s q : “ p b c p µ, i qq µ P M p i,s,k q . The read i is accepted if the median of the numbers in C p i, s q isbelow a ﬁxed threshold, usually 20, for each s P r m p i qs . It was demonstrated that successfulassemblies are still possible after Diginorm removed the majority of the data. Diginorm is a pioneering work. However, the following points, which are important from thebiological or computational quality point of view, are not covered in Diginorm. We presentthem as an enhancement in our work:(i) We incorporate the important phred quality score into the decision whether to accept or toreject a read, using a quality threshold. This allows a tuning of the ﬁltering process towardshigh-quality assemblies, by using diﬀerent thresholds.(ii) When deciding whether to accept or to reject read i , we do a detailed analysis of the numbersin the vectors C p i, s q . Diginorm merely considers their medians.(iii) We oﬀer a better handling of the N case, that is, when the sequencing machine could notdecide for a particular nucleotide. Diginorm simply converts all N to A , which can lead tofalse k -mer counts. (iv) We provide a substantially faster implementation. For example, we include fast hashingfunctions (see [22, 38]) for counting k -mers through the count-min sketch data structure(CMS), and we use the C programming language and OpenMP.A detailed description of our algorithm, called Bignorm , is given in the next section. Its namewas chosen to emphasize the goal of drastically reducing massive datasets.Bignorm, like Diginorm, is based on the count-min sketch (CMS) for counting k -mers. CMS isa probabilistic data structure for counting objects from a large universe. We give a brief andabstract description. Let a “ p a , . . . , a N q P N N be a vector, given implicitly as a sequence ofupdates of the form p p, ∆ q with p P r N s and ∆ P N . Each update p p, ∆ q modiﬁes a in the way a p : “ a p ` ∆; where initially a “ p , . . . , q . If ∆ “ We have observed some evidence that this may lead to a spuriously higher GC content. This will beinvestigated in future work.

5f the vector a is that we count how many times we observe each of the objects identiﬁedby the numbers in r N s . If N is large, e. g., if N is the number 4 k of all possible k -mers (wedo not count k -mers with N symbols), then we may not be able to store a in RAM. (Forexample, the typical choice of k “

21 brings a into terabyte range; in our experiments we use k “ width m P N and the depth t P N and store amatrix of m ¨ t CMS counters c p,q with p P r m s and q P r t s . Moreover, we randomly draw t hash functions h , . . . , h t from a universal family. Each h q maps from r N s to r m s . Initially, allcounters in the matrix are zero. Upon arrival of an update p p, ∆ q , for each row q P r t s we update c h q p p q ,q : “ c h q p p q ,q ` ∆. That is, for each row q we use the hash function h q to map from thelarger space r N s (from which the index p comes) to the smaller space r m s of possible positionsin the row. Denote b a p : “ min { c h p p q , , . . . , c h t p p q ,t } . (1)Then it can be proved [18] that b a p is an estimate of a p in the following sense: clearly a p ď b a p ,and with probability at least 1 ´ e ´ t we have b a p ď em ´ P Nj “ a j . The probability is over thechoice of hash functions. For example, choosing t : “

10 is enough to push the error probability,upper-bounded by e ´ t , below 0 . N “ k is the number of possible k -mers (without N symbols) and weimplement a bijection β : Σ k ÝÑ r N s , so we can identify each k -mer µ by a number β p µ q P r N s .Upon accepting some read i , we update the CMS counters using all the updates of the form p β p µ q , q with µ P M p i, k q not containing the N symbol, that is, for each such µ we increase thecount β p µ q by ∆ “

1. Then when all the reads 1 , . . . , i ´ c p µ, i q corresponds to the entry a β p µ q in the vector a used in the description of CMS, andfor the estimate b c p µ, i q we can use the estimate b a β p µ q as given in (1). We give a detailed description of our enhancements (i) to (iv) that were brieﬂy lined out on thepreceding page. Although most of the settings are generic, in some places we assume that datacomes from the Illumina.We start with (i), (ii), and (iii). Fix a read i P r n s and a read string s P r m p i qs . Recall that foreach t P r | R p i, s q | s the nucleotide R t p i, s q at position t in the read string R p i, s q is associatedwith a quality value Q t p i, s q known as phred score . We want to assign a single value Q p i, s, µ, p q to each p µ, p q P M ˚ p i, s, k q . We do so by taking the minimum phred score over the nucleotidesin µ when aligned at position p , that is: Q p i, s, µ, p q : “ p ` k ´ min t “ p Q t p i, s q ( µ occurs on the right-hand side only implicitely through its length k .)6ix the following parameters: • N -count threshold N P N , which is 10 by default; • quality threshold Q P Z , which is 20 by default; • rarity threshold c P N , which is 3 by default; • abundance threshold c P N , which is 20 by default; • contribution threshold B P N , which is 3 by default.When our algorithm has to decide whether to accept or to reject a read i P r n s , it performs thefollowing steps. If the number of N symbols counted over all m p i q read strings in i is larger than N , the read is rejected right away. Otherwise, for each s P r m p i qs deﬁne the set of high-quality k -mers : H p s q : “ n p µ, p q P M ˚ p i, s, k q ; p Q ď Q p i, s, µ, p qq and ( µ does not contain N ) o We determine the contribution of R p i, s q to k -mers of diﬀerent frequencies: b p s q : “ |{ p µ, p q P H p s q ; b c p µ, i q ă c }| b p s q : “ |{ p µ, p q P H p s q ; c ď b c p µ, i q ă c }| Note that the frequencies are determined via CMS counters and do not consider the position p at which the k -mer is found in the read string. The read i is accepted if and only if at least oneof the following conditions is met: b p s q ą k for at least one read string s (2) m p i q X s “ b p s q ě B (3)If the read is accepted, then for each µ P M p i, k q the corresponding CMS counter is incre-mented, provided that µ does not contain the N symbol. Then processing of the next readstarts.The rationale for condition (2) is as follows. If a k -mer is seen less than c times, we suspect itto be the result of a read error. However, if more than k k -mers in a read string contain anerror, this read string must have more than one erroneous nucleotide. This is not likely for theIllumina platform, since there, most errors are single substitutions [29]. So if b p s q ą k for some s , then the read string R p i, s q should be assumed to correctly contain a rare k -mer, so it mustnot be ﬁltered out.Condition (3) says that in the read i , there are enough (namely at least B ) k -mers where eachof them is too frequent to be a read error (CMS counters at least c ) but not so abundant thatit should be considered redundant (CMS counters less than c ).7his concludes the description of (i), (ii), and (iii), namely how we analyze the counts in C p i, s q “p b c p µ, i qq µ P M p i,s,k q for each read i and s P r m p i qs , how we incorporate quality information, andhow we handle the N symbol.Finally, to accomplish (iv), we wrote a multi-threaded implementation completely in the Cprogramming language. The parallel code uses OpenMP. For comparison, the implementation ofthe original Diginorm algorithm (included in the khmer-package [20]) features a single-threadeddesign and is written in Python and C ++ ; strings have to be converted between Python and C ++ at least twice. For the experimental evaluation, we collected the following datasets. We use two single celldatasets of the UC San Diego, one of the group of Ute Hentschel (now GEOMAR Kiel) and 10datasets from the JGI Genome Portal. The datasets from JGI were selected as follows. On theJGI Genome Portal [13], we used “single cell” as search term. We narrowed the results down todatasets which had all of the following properties: • status “complete”; • containing read data and an assembly in the download section; • aligning the reads to the assembly using bowtie2 [30] yields an “overall alignment rate” ofmore than 70%.From those datasets, we arbitrarily selected one per species, until we had a collection of 10datasets. We refer to each combination of species and selected dataset as a case in the following.In total, we have 13 cases; the details are given in Table 1. Short Name Species/Description Source URLASZN2 Candidatus Poribacteria sp. WGA-4E_FD Hentschel Group [28] [7]Aceto Acetothermia bacterium JGI MDM2 LHC4sed-1-H19 JGI Genome Portal [1]Alphaproteo Alphaproteobacteria bacterium SCGC AC-312_D23v2 JGI Genome Portal [2]Arco Arcobacter sp. SCGC AAA036-D18 JGI Genome Portal [3]Arma Armatimonadetes bacterium JGI 0000077-K19 JGI Genome Portal [4]Bacteroides Bacteroidetes bacVI JGI MCM14ME016 JGI Genome Portal [5]Caldi Calescamantes bacterium JGI MDM2 SSWTFF-3-M19 JGI Genome Portal [6]Caulo Caulobacter bacterium JGI SC39-H11 JGI Genome Portal [8]Chloroﬂexi Chloroﬂexi bacterium SCGC AAA257-O03 JGI Genome Portal [9]Crenarch Crenarchaeota archaeon SCGC AAA261-F05 JGI Genome Portal [10]Cyanobact Cyanobacteria bacterium SCGC JGI 014-E08 JGI Genome Portal [11]E.coli E.coli K-12, strain MG1655, single cell MDA, Cell one UC San Diego [14]SAR324 SAR324 (Deltaproteobacteria) UC San Diego [14]

Table 1.

Selected Species and Datasets (Cases) Q P { , , , , , , , . . . , } . Analysis is done on the one hand in terms ofdata reduction, quality, and coverage. On the other hand, we study actual assemblies that arecomputed with SPAdes [16] based on the raw and ﬁltered datasets. All the details are given inthe next section.The dimensions of the count-min sketch are ﬁxed to m “ t “

10, thus 10 GB of RAMwhere used.

We do analysis in large parts by looking at percentiles and quartiles. The i th quartile is denotedQ i , where we use Q0 for the minimum, Q2 for the median, and Q4 for the maximum. The i thpercentile is denoted P i ; we often use the 10th percentile P10. Statistics for the number of accepted reads are given as a boxplot in Figure 1a on the nextpage. This plot is constructed as follows. Each of the blue boxes corresponds to Bignorm with aparticular Q , while Diginorm is represented as the wide orange box in the background (recallthat Diginorm does not consider quality values). Note that the “whiskers” of Diginorm’s boxare shown as light-orange areas. For each box, for each case the raw dataset is ﬁltered usingthe algorithm and algorithmic parameters corresponding to that box, and the percentage of theaccepted reads is taken into consideration. So for example, if the top of a box (which correspontsto the 3rd quartile, also denoted Q3) gives the value x %, then we know that for 75% of thecases, x % or less of the reads were accepted using the algorithm and algorithmic parameterscorresponding to the box.There are two prominent outliers: one for Diginorm with value «

29% (shown as the red line atthe top) and one for Bignorm for Q “ « ď Q , even Bignorm’soutliers fall below Diginorm’s median, and for 18 ď Q Bignorm keeps less than 5% of thereads for at least 75% of the datasets. In the range 20 ď Q ď

25, Bignorm delivers similarresults for the diﬀerent Q , and the gain in reduction for larger Q is small up to Q “ Q , there is another jump in reduction, but we will see that coverage andthe quality of the assembly suﬀer too much in that range. We conjecture that in the range18 ď Q ď

32, we remove most of the actual errors, whereas for larger Q we also remove usefulinformation. 9 l l l l l ll l l Bignorm using Q set to P e r c en t age o f r ead s k ep t diginorm: Min / Maxdiginorm: 1st / 3rd Quartilediginorm: Mediandiginorm: OutlierBignorm (a) Percentage of accepted reads (i. e., reads kept) overall datasets. ll lll

Bignorm using Q set to D i s t r i bu t i on o f m ean P h r ed S c o r e a ft e r no r m a li z a t i on diginorm: Min / Maxdiginorm: 1st / 3rd Quartilediginorm: MedianBignorm (b) Mean quality values of the accepted reads over alldatasets.

Figure 1.

Boxplots showing reduction and quality statistics.

Statistics for phred quality scores in the ﬁltered datasets are given in Figure 1b on this page.The data was obtained using fastx_quality_stats from the FASTX Toolkit [12] on the ﬁlteredfastq ﬁles and calculating the mean phred quality scores over all read positons for each dataset.Looking at the statistics for these overall means, for 15 ď Q , Bignorm’s median is betterthan Diginorm’s maximum. For 20 ď Q , this eﬀect becomes even stronger. For all Q values,Bignorm’s minimum is clearly above Diginorm’s median. Note that 10 units more meansreducing error probability by factor 10.In Table 2, we give quartiles of mean quality values for the raw datasets and Bignorm’sdatasets produced with Q “

20. Bignorm improves slightly on the raw dataset in all ﬁvequartiles.Of course, all this could be explained by Bignorm simply cutting away any low-quality reads.However, the data in the next section suggests that Bignorm may in fact be more careful thanthis. 10uartile Bignorm rawQ4 (max) 37.82 37.37Q3 37.33 36.52Q2 (median) 33.77 32.52Q1 31.91 30.50Q0 (min) 26.14 24.34

Table 2.

Comparing quality values for the raw dataset and Bignorm with Q “ In Figure 2 on page 13 we see statistics for the coverage. The data was obtained by remappingthe ﬁltered reads onto the assembly from the JGI using bowtie2 and then using coverageBed from the bedtools [33] and R [34] for the statistics. In Figure 2a, the mean is considered. For15 ď Q , Bignorm reduces the coverage heavily. For 20 ď Q , Bignorm’s Q3 is below Diginorm’sQ1. This may raise the concern that Bignorm could create areas with insuﬃcient coverage.However, in Figure 2b, we look at the 10th percentile (P10) of the coverage instead of themean. We consider this statistics as an indicator for the impact of the ﬁltering on areas withlow coverage. For Q ď

25, Bignorm’s Q3 is on or above Diginorm’s maximum, and Bignorm’sminimum coincides with Diginorm’s (except for Q “

10, where we are slightly below). Interms of median, both algorithms are very similar for Q ď

25. We consider all this as a strongindication that we cut away in the right places.For 28 ď Q , there is a clear drop in coverage, so we do not recommend such Q val-ues.In Table 3, we give coverage statistics for each dataset. The reduction compared to theraw dataset in terms of mean, P90, and maximum is substantial. But also the improve-ment of Bignorm over Diginorm in mean, P90, and maximum is considerable for mostdatasets. The quality and signiﬁcance of read ﬁltering is subject to complete assemblies, which is the ﬁnal“road test” of algorithms. For each case, we do an assembly with SPAdes using the raw datasetand those ﬁltered with Diginorm and Bignorm for a selection of Q values. The assemblies arethen analyzed using quast [25] and the assembly from the JGI as reference. Statistics for fourcases are shown in Figure 3. We give the quality measures N50, genomic fraction, and largestcontig, and in addition the overall running time (pre-processing plus assembly). Each measureis given in percent relative to the raw dataset.11 ataset Algorithm P10 mean P90 maxAceto Bignorm 6 132 216 6801Diginorm 7 171 295 12020raw 15 9562 17227 551000Alphaproteo Bignorm 10 43 92 884Diginorm 7 173 481 6681raw 25 5302 14070 303200Arco Bignorm 1 98 54 2103Diginorm 1 362 200 6114raw 3 10850 4091 220600Arma Bignorm 8 23 32 358Diginorm 8 79 141 5000raw 17 629 1118 31260ASZN2 Bignorm 40 70 83 2012Diginorm 23 143 354 3437raw 50 1738 4784 43840Bacteroides Bignorm 3 74 90 6768Diginorm 3 123 205 7933raw 7 6051 8127 570900Caldi Bignorm 25 63 110 786Diginorm 15 67 135 3584raw 27 1556 3643 33530Caulo Bignorm 7 228 216 10400Diginorm 8 362 491 35520raw 8 10220 9737 464300Chloroﬂexi Bignorm 8 72 101 2822Diginorm 9 412 878 20850raw 9 5612 7741 316900Crenarch Bignorm 8 104 159 3770Diginorm 10 560 1285 29720raw 10 8086 14987 316700Cyanobact Bignorm 9 144 153 5234Diginorm 10 756 1450 26980raw 10 9478 11076 356600E.coli Bignorm 37 45 56 234Diginorm 50 382 922 7864raw 112 2522 6378 56520SAR324 Bignorm 24 49 71 1410Diginorm 18 53 107 2473raw 26 1086 2761 106000 Table 3.

Coverage statistics for Bignorm with Q “

20, Diginorm, and the raw datasets. l l l l l ll ll ll l Bignorm using Q set to D i s t r i bu t i on o f m ean c o v e r age a ft e r no r m a li z a t i on diginorm: Min / Maxdiginorm: 1st / 3rd Quartilediginorm: MedianBignorm (a) Mean coverage over all datasets. llll lll lll ll ll

Bignorm using Q set to D i s t r i bu t i on o f % P e r c en t il e c o v e r age a ft e r no r m a li z a t i on diginorm: Min / Maxdiginorm: 1st / 3rd Quartilediginorm: MedianOutlierBignorm (b) Figure 2.

Boxplots showing coverage statistics.

Generally, our biggest improvements are for N50 and running time. For 15 ď Q , Bignormis always faster than Diginorm, for three of the four cases by a large margin. In terms ofN50, for 15 ď Q we observe improvements for three cases. For E.coli, Diginorm’s N50 is100%, that we also attain for Q “

20. In terms of genomic fraction and largest contig, wecannot always attain the same quality as Diginorm; the biggest deviation at Q “

20 is 10percentage points for the ASZN2 case. The N50 is generally accepted as one of the mostimportant measures, as long as the assembly respresents the genome well (as mesured here bythe genomic fraction) [23].In Table 4, we give statistics for Q “

20 and each case. In terms of genomic fraction, Bignormis generally not as good as Diginorm. However, excluding the Aceto and Arco cases, Bignorm’sgenomic fraction is still always at least 95%. For Aceto and Arco, Bignorm misses 3 .

21% and3 . ataset Algorithm reads kept mean phred contigs ﬁlter time SPAdes timein % score ě

10 000 in sec in secAceto Bignorm 3.16 37.33 1 906 1708Diginorm 3.95 27.28 1 3290 4363raw 36.52 3 47813Alphaproteo Bignorm 3.13 34.65 18 623 420Diginorm 7.81 28.73 17 1629 11844raw 33.64 17 29057Arco Bignorm 2.20 33.77 4 429 207Diginorm 8.76 21.39 6 1410 1385raw 32.27 6 15776Arma Bignorm 7.90 28.21 44 240 135Diginorm 29.30 21.19 50 588 1743raw 26.96 44 5371ASZN2 Bignorm 5.66 37.66 118 1224 1537Diginorm 12.62 32.73 130 5125 21626raw 36.85 112 47859Bacteroides Bignorm 2.85 37.47 6 653 3217Diginorm 4.94 27.64 5 2124 3668raw 37.25 9 32409Caldi Bignorm 3.97 37.82 41 842 455Diginorm 5.61 30.67 36 1838 793raw 37.37 38 7563Caulo Bignorm 2.40 36.95 10 679 712Diginorm 4.70 25.16 9 2584 765raw 36.01 13 18497Chloroﬂexi Bignorm 1.40 31.91 32 694 134Diginorm 9.70 18.91 33 2304 1852raw 30.50 34 15108Crenarch Bignorm 1.46 33.18 19 1107 790Diginorm 9.72 19.80 18 2931 3754raw 31.49 26 20590Cyanobact Bignorm 1.65 30.45 12 679 450Diginorm 11.30 17.58 13 1487 1343raw 28.49 13 9417E. coli Bignorm 1.91 26.14 67 2279 598Diginorm 17.03 19.34 63 9105 3995raw 24.34 64 16706SAR324 Bignorm 4.34 33.05 55 1222 708Diginorm 4.69 23.58 52 3706 3085raw 32.52 51 26237

Table 4.

Filter and assembly statistics for Bignorm with Q “

20, Diginorm and the raw datasets (I) ataset Algorithm N50 Longest Contig Length Genomic Fraction Misassembled Contig Lengthabs % ofraw % ofDiginorm abs % ofraw % ofDiginorm abs % ofraw % ofDiginorm abs % ofraw % ofDiginormBignorm 2324 79 105 11525 98 100 91 97 97 52487 148 178Aceto Diginorm 2216 76 11525 98 94 100 29539 84raw 2935 11772 94 35351Bignorm 11750 94 115 43977 91 95 98 101 105 52001 120 89Alphaproteo Diginorm 10213 82 46295 95 93 95 58184 134raw 12446 48586 98 43388Bignorm 3320 81 97 12808 57 57 85 100 97 76797 99 91Arco Diginorm 3434 84 22463 100 88 103 84613 109raw 4092 22439 85 77888Bignorm 18432 102 107 108140 100 100 98 100 100 774291 91 103Arma Diginorm 17288 96 108498 100 98 100 748560 88raw 18039 108498 98 849085Bignorm 19788 91 88 72685 71 88 97 99 99 2753167 94 105ASZN2 Diginorm 16591 76 82687 81 97 100 2617095 89raw 21784 102287 97 2941524Bignorm 3356 68 100 25300 100 100 95 98 99 70206 105 112Bacteroides Diginorm 3356 68 25300 100 96 99 62882 94raw 4930 25299 98 66626Bignorm 50973 82 83 143346 89 91 100 100 100 573836 94 68Caldi Diginorm 61108 98 157479 98 100 100 839126 138raw 62429 160851 100 609604Bignorm 4515 69 95 20255 100 107 96 98 98 60362 86 113Caulo Diginorm 4729 72 18907 93 98 101 53456 76raw 6562 20255 97 70161Bignorm 13418 102 109 79605 102 102 99 100 100 666519 95 93Chloroﬂexi Diginorm 12305 93 78276 100 100 100 716473 102raw 13218 78276 99 703171Bignorm 6538 77 91 31401 81 66 97 99 99 484354 89 95Crenarch Diginorm 7148 84 47803 124 98 100 510256 94raw 8501 38582 98 544763Bignorm 5833 95 99 33462 98 100 99 101 100 236391 113 110Cyanobact Diginorm 5907 96 33516 98 99 101 214574 103raw 6130 34300 98 209269Bignorm 112393 100 100 268306 94 94 96 100 100 28966 65 65E. coli Diginorm 112393 100 285311 100 96 100 44465 100raw 112393 285528 96 44366Bignorm 135669 100 114 302443 100 100 99 100 100 4259479 98 100SAR324 Diginorm 119529 88 302443 100 99 100 4264234 98raw 136176 302442 99 4342602 Table 5.

Filter and assembly statistics for Bignorm with Q “

20, Diginorm and the raw datasets (II)

Table 6.

Quartiles for comparison of mean phred score, ﬁlter and assembly time in %.

The quality parameter Q that Bignorm introduces over Diginorm has shown to have a strongimpact on the number of reads kept, coverage, and quality of the assembly. An upper bound of Q ď

25 for a reasonable Q was obtained by considering the 10th percentile of the coverage(Figure 2b). With this constraint in mind, in order to have a small number of reads kept,Figure 1a suggests 18 ď Q ď

25. Given that N50 for E.coli starts to decline at Q “ Q “

20 as the recommended value. As seen in detail in Table 4, Q “

20 gives good assemblies for all 13 cases. The gain in speed is considerable: in terms ofmedian we only require 31% and 18% of Diginorm’s time for ﬁltering and assembly, respectively.This speedup generally comes at the price of a smaller genomic fraction and smaller largestcontig, although those diﬀerences are relatively small.

For 13 bacteria single cell datasets, we have shown that good and fast assemblies are possible,based on only 5% of the reads in most of the cases (and on less than 10% of the reads in allof the cases). The ﬁltering process, using our new algorithm Bignorm, also works fast andmuch faster than Diginorm. Like Diginorm, we use a count-min sketch for counting k -mers, soour the memory requirements are relatively small and known in advance. We provide tuningfor the quality parameter Q and recommend to use Q “

20 in practice. We refrained fromtuning the other parameters c , c that are used to deﬁne the contributions b p s q and b p s q ,as well as the N -count threshold N and contribution threshold B . We expect that tuning ofthose parameters will help to obtain assemblies of higher quality and intend to do so in futurework. 16 ceto Alphaproteo ASZN2 E. coli102030405060708090100110 10 20 30 40 10 20 30 40 10 20 30 40 10 20 30 40 Bignorm using Q set to P e r c en t age Measure

Largest contigN50Genomic fractionOverall run time id Bignormdiginorm

Figure 3.

Statistics for the assemblies of four selected datasets.

References [1] Acetothermia bacterium JGI MDM2 LHC4sed-1-H19. URL: http://genome.jgi.doe.gov/AcebacLHC4se1H19_FD/AcebacLHC4se1H19_FD.info.html .[2] Alphaproteobacteria bacterium SCGC AC-312_D23v2. URL: http://genome.jgi.doe.gov/AlpbacA312_D23v2_FD/AlpbacA312_D23v2_FD.info.html .173] Arcobacter sp. SCGC AAA036-D18. URL: http://genome.jgi.doe.gov/ArcspSAAA036D18_FD/ArcspSAAA036D18_FD.info.html .[4] Armatimonadetes bacterium JGI 0000077-K19. URL: http://genome.jgi.doe.gov/Armbac0000077K19_FD .[5] Bacteroidetes bacVI JGI MCM14ME016. URL: http://genome.jgi.doe.gov/BacbacMCM14ME016_FD .[6] Calescamantes bacterium JGI MDM2 SSWTFF-3-M19. URL: http://genome.jgi.doe.gov/CalbacSSWTFF3M19_FD .[7] Candidatus Poribacteria sp. WGA-4E. URL: http://genome.jgi.doe.gov/CanPorspWGA4E_FD .[8] Caulobacter bacterium JGI SC39-H11. URL: http://genome.jgi.doe.gov/CaubacJGISC39H11_FD .[9] Chloroﬂexi bacterium SCGC AAA257-O03. URL: http://genome.jgi.doe.gov/ChlbacSAAA257O03_FD .[10] Crenarchaeota archaeon SCGC AAA261-F05. URL: http://genome.jgi.doe.gov/CrearcSAAA261F05_FD .[11] Cyanobacteria bacterium SCGC JGI 014-E08. URL: http://genome.jgi.doe.gov/CyabacSJGI014E08_FD .[12] FASTX-Toolkit. URL: http://hannonlab.cshl.edu/fastx_toolkit/ .[13] JGI Genome Portal - Home. URL: http://genome.jgi.doe.gov .[14] Single cell data sets. URL: http://bix.ucsd.edu/projects/singlecell/nbt_data.html .[15] Andy S. Alic, David Ruzafa, Joaquin Dopazo, and Ignacio Blanquer. Objective reviewof de novo stand-alone error correction methods for NGS data.

Wiley InterdisciplinaryReviews: Computational Molecular Science , 6(2):111–146, 2016. doi:10.1002/wcms.1239 .[16] Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin,Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son Pham, Andrey D. Prjibelski,Alexey V. Pyshkin, Alexander V. Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev,and Pavel A. Pevzner. SPAdes: A New Genome Assembly Algorithm and Its Applicationsto Single-Cell Sequencing.

Journal of Computational Biology , 19(5):455–477, Apr 2012. doi:10.1089/cmb.2012.0021 .[17] Anthony M. Bolger, Marc Lohse, and Bjoern Usadel. Trimmomatic: A ﬂexible trimmerfor Illumina Sequence Data.

Bioinformatics , 30(15):2114–2120, 2014. doi:10.1093/bioinformatics/btu170 .[18] Graham Cormode and S. Muthukrishnan. An improved data stream summary: thecount-min sketch and its applications.

Journal of Algorithms , 55(1):58–75, 2005. doi:10.1016/j.jalgor.2003.12.001 . 1819] Murray P. Cox, Daniel A. Peterson, and Patrick J. Biggs. SolexaQA: At-a-glance qualityassessment of Illumina second-generation sequencing data.

BMC Bioinformatics , 11(1):1–6,2010. doi:10.1186/1471-2105-11-485 .[20] Michael Crusoe, Greg Edvenson, Jordan Fish, Adina Howe, Eric McDonald, JoshuaNahum, Kaben Nanlohy, Humberto Ortiz-Zuazaga, Jason Pell, Jared Simpson, CamilleScott, Ramakrishnan Rajaram Srinivasan, Qingpeng Zhang, and C. Titus Brown. Thekhmer software package: enabling eﬃcient sequence analysis. pages 1–3, 04 2014. doi:10.6084/m9.figshare.979190 .[21] Cristian Del Fabbro, Simone Scalabrin, Michele Morgante, and Federico M. Giorgi. AnExtensive Evaluation of Read Trimming Eﬀects on Illumina NGS Data Analysis.

PLoSONE , 8(12):1–13, 12 2013. doi:10.1371/journal.pone.0085024 .[22] Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti Penttonen. AReliable Randomized Algorithm for the Closest-Pair Problem.

Journal of Algorithms ,25(1):19–51, 1997. doi:10.1006/jagm.1997.0873 .[23] Dent Earl, Keith Bradnam, John St John, Aaron Darling, Dawei Lin, Joseph Fass, HungOn Ken Yu, Vince Buﬀalo, Daniel R Zerbino, Mark Diekhans, et al. Assemblathon 1:A competitive assessment of de novo short read assembly methods.

Genome research ,21(12):2224–2241, 2011. doi:10.1101/gr.126599.111 .[24] Sante Gnerre, Iain Maccallum, Dariusz Przybylski, Filipe J Ribeiro, Joshua N Burton,Bruce J Walker, Ted Sharpe, Giles Hall, Terrance P Shea, Sean Sykes, Aaron M Berlin,Daniel Aird, Maura Costello, Riza Daza, Louise Williams, Robert Nicol, Andreas Gnirke,Chad Nusbaum, Eric S Lander, and David B Jaﬀe. High-quality draft assemblies ofmammalian genomes from massively parallel sequence data.

Proc Natl Acad Sci U S A ,108(4):1513–1518, 1 2011. doi:10.1073/pnas.1017351108 .[25] Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. QUAST: qualityassessment tool for genome assemblies.

Bioinformatics , 29(8):1072–1075, 2013. doi:10.1093/bioinformatics/btt086 .[26] NA Joshi and JN Fass. Sickle: A sliding-window, adaptive, quality-based trimming toolfor FastQ ﬁles (Version 1.33). Available at https://github.com/najoshi/sickle , 2011.[27] Rei Kajitani, Kouta Toshimoto, Hideki Noguchi, Atsushi Toyoda, Yoshitoshi Ogura, MikiOkuno, Mitsuru Yabana, Masahira Harada, Eiji Nagayasu, Haruhiko Maruyama, YujiKohara, Asao Fujiyama, Tetsuya Hayashi, and Takehiko Itoh. Eﬃcient de novo assembly ofhighly heterozygous genomes from whole-genome shotgun short reads.

Genome Research ,pages 1384–1395, 2014. doi:10.1101/gr.170720.113 .[28] Janine Kamke, Alexander Sczyrba, Natalia Ivanova, Patrick Schwientek, Christian Rinke,Kostas Mavromatis, Tanja Woyke, and Ute Hentschel. Single-cell genomics reveals complex19arbohydrate degradation patterns in poribacterial symbionts of marine sponges.

TheISME journal , 7(12):2287–2300, 2013. doi:10.1038/ismej.2013.111 .[29] David R Kelley, Michael C Schatz, and Steven L Salzberg. Quake: quality-aware de-tection and correction of sequencing errors.

Genome Biol , 11(11):1–13, 2010. doi:10.1186/gb-2010-11-11-r116 .[30] Ben Langmead and Steven L. Salzberg. Fast gapped-read alignment with Bowtie 2.

NatMeth , 9(4):357–359, 4 2012. Brief Communication. doi:10.1038/nmeth.1923 .[31] Marcel Martin. Cutadapt removes adapter sequences from high-throughput sequencingreads.

EMBnet. journal , 17(1):10–12, 2011. doi:10.14806/ej.17.1.200 .[32] Nicola Prezza, Cristian Del Fabbro, Francesco Vezzi, Emanuale De Paoli, and AlbertoPolicriti. ERNE-BS5: Aligning BS-treated Sequences by Multiple Hits on a 5-lettersAlphabet. In

Proceedings of the ACM Conference on Bioinformatics, ComputationalBiology and Biomedicine , BCB ’12, pages 12–19, New York, NY, USA, 2012. ACM. doi:10.1145/2382936.2382938 .[33] Aaron R. Quinlan and Ira M. Hall. BEDTools: a ﬂexible suite of utilities for comparinggenomic features.

Bioinformatics , 26(6):841–842, 2010. doi:10.1093/bioinformatics/btq033 .[34] R Core Team.

R: A Language and Environment for Statistical Computing . R Foundationfor Statistical Computing, Vienna, Austria, 2016. Available at ,Version 3.3.0.[35] Robert Schmieder and Robert Edwards. Quality control and preprocessing of metagenomicdatasets.

Bioinformatics , 27(6):863–864, 2011. doi:10.1093/bioinformatics/btr026 .[36] Linnéa Smeds and Axel Künstner. ConDeTri - A Content Dependent Read Trimmer forIllumina Data.

PLoS ONE , 6(10):1–6, 10 2011. doi:10.1371/journal.pone.0026314 .[37] C. Titus Brown, A. Howe, Q. Zhang, A. B. Pyrkosz, and T. H. Brom. A Reference-FreeAlgorithm for Computational Normalization of Shotgun Sequencing Data.

ArXiv e-prints ,pages 1–18, March 2012. arXiv:1203.4802 .[38] Philipp Wölfel.

Über die Komplexität der Multiplikation in eingeschränkten Branchingpro-grammmodellen . PhD thesis, Universität Dortmund, Fachbereich Informatik, 2003.[39] Qingpeng Zhang, Jason Pell, Rosangela Canino-Koning, Adina Chuang Howe, and C. TitusBrown. These Are Not the K-mers You Are Looking For: Eﬃcient Online K-mer CountingUsing a Probabilistic Data Structure.

PLoS ONE , 9(7):1–13, 07 2014. doi:10.1371/journal.pone.0101271doi:10.1371/journal.pone.0101271