Accurate selfcorrection of errors in long reads using de Bruijn graphs
aa r X i v : . [ q - b i o . GN ] A p r Accurate selfcorrection of errors in long readsusing de Bruijn graphs
Leena Salmela Riku Walve Eric Rivals Esko Ukkonen Helsinki Institute for Information Technology HIIT,Department of Computer Science, University of Helsinki { lmsalmel,riqwalve,ukkonen } @cs.helsinki.fi LIRMM and Institut de Biologie Computationelle,CNRS and Universit´e Montpellier, Montpellier, [email protected]
Abstract
New long read sequencing technologies, like PacBio SMRT and Oxford NanoPore, can pro-duce sequencing reads up to 50,000 bp long but with an error rate of at least 15%. Reducing theerror rate is necessary for subsequent utilisation of the reads in, e.g., de novo genome assembly.The error correction problem has been tackled either by aligning the long reads against eachother or by a hybrid approach that uses the more accurate short reads produced by secondgeneration sequencing technologies to correct the long reads. We present an error correctionmethod that uses long reads only. The method consists of two phases: first we use an iterativealignment-free correction method based on de Bruijn graphs with increasing length of k With the diminishing costs, high throughput DNA sequencing has become a commonplace tech-nology in biological research. Whereas the second generation sequencers produced short but quiteaccurate reads, new technologies such as Pacific Biosciences and Oxford NanoPore are producingreads up to 50,000 bp long but with an error rate at least 15%. Although the long reads have provento be very helpful in applications like genome assembly [11, 15], the error rate poses a challenge forthe utilisation of this data.Many methods have been developed for correcting short reads [23, 12] but these methods arenot directly applicable to the long reads because of their much higher error rate. Moreover, mostresearch of short read error correction has concentrated on mismatches, the dominant error type inIllumina data, whereas in long reads indels are more common. Recently several methods for errorcorrection of long reads have also been developed. These methods fall into two categories: eitherthe highly erroneous long reads are selfcorrected by aligning them against each other, or a hybridstrategy is adopted in which the long reads are corrected using the accurate short reads that areassumed to be available. Most standalone error correction tools like proovread [9], LoRDEC [20],LSC [1], and Jabba [16] are hybrid methods. PBcR [10, 3] is a tool that can employ either thehybrid or selfcorrection strategy. 1ost hybrid methods like PBcR, LSC, and proovread are based on the mapping approach. Theyfirst map the short reads on the long reads and then correct the long reads according to a consensusbuilt on the mapped short reads. PBcR extends this strategy to selfcorrection of PacBio reads bycomputing overlaps between the long reads using probabilistic locality-sensitive hashing and thencorrecting the reads according to a consensus built on the overlapping reads. As the mapping ofshort reads is time and memory consuming, LoRDEC avoids the mapping phase by building a deBruijn graph (DBG) of the short reads and then threading the long reads through this graph tocorrect them. Jabba is a recent tool that is also based on building a DBG of short reads. WhileLoRDEC finds matches of complete k -mers in the long reads, Jabba searches for maximal exactmatches between the k -mers and the long reads allowing it to use a larger k in the DBG.In this paper, we present a selfcorrection method for long reads that is based on de Bruijngraphs and multiple alignments. First our method performs initial correction that is similar toLoRDEC, but uses only long reads and performs iterative correction rounds with longer and longer k -mers. This phase considers only the local context of errors and hence it misses the long-distancedependency information available in the long reads. To capture such dependencies, the secondphase of our method uses multiple alignments between carefully selected reads to further improvethe error correction.Our experiments show that our method is currently the most accurate one relying on long readsonly. The error rate of the reads after our error correction is less than half of the error rate ofreads corrected by PBcR using long reads only. Furthermore, when the coverage of the read set isat least 75x, the size of the corrected read set of our method is at least 20% higher than for PBcR. LoRDEC [20] is a hybrid method for the error correction of long reads. It presents the short readsin a de Bruijn graph (DBG) and then maps the long reads to the graph. The DBG of a read set isa graph whose nodes are all k -mers occurring in the reads and there is an edge between two nodesif the corresponding k -mers overlap by k − k -mers of long reads as solid if they are in the DBG and weak otherwise. The correction then proceeds by replacing theweak areas of the long reads by solid ones. This is done by searching paths in the DBG betweensolid k -mers to bridge the weak areas between them. If several paths are found, the path withthe shortest edit distance as compared to the weak region is chosen to be the correct sequence,which replaces the weak region of the long read. The weak heads and tails of the long reads arethe extreme regions of the reads that are bordered by just one solid k -mer in the beginning (resp.end) of the read. LoRDEC attempts to correct these regions by starting a path search from thesolid k -mer and choosing a sequence that is as close as possible to the weak head or tail.Repetitive regions of the genome can make the DBG tangled. The path search in these areasof the DBG can then become intractable. Therefore LoRDEC employs a limit on the number ofbranches it explores during the search. If this limit is exceeded, LoRDEC checks if at least onepath within the maximum allowed error rate has been found and then uses the best path found forcorrection. If no such path has been found, LoRDEC starts a path search similar to the correctionof the head and tail of the read, to attempt a partial correction of the weak region.Some segments of the long reads remain erroneous after the correction. LoRDEC outputs basesin upper case if at least one of the k -mers containing that base is solid, i.e., it occurs in the DBG ofthe short reads, and in lower case otherwise. For most applications it is preferable to extract onlythe upper case regions of the sequences as the lower case bases are likely to contain errors.2acBio reads LoRDEC ∗ LoRMA Corrected readsincrease k Figure 1: Workflow of error correction. LoRDEC ∗ is first applied iteratively to the read set, with anincreasing k . The corrected reads are further corrected by LoRMA which uses multiple alignments to findlong-distance dependencies in the reads. In this section we will show how an error correction procedure similar to LoRDEC can be used toiteratively correct long reads without short read data. We will use LoRDEC ∗ to refer to LoRDECin this long reads only mode. Then we further describe a polishing method to improve the accuracyof correction. Figure 1 shows the workflow of our approach. To describe how LoRDEC can be adapted for selfcorrection of read sets, let Q be a set of long readsto be corrected, and let integer h be the abundancy threshold that is used in choosing the k -mers tothe DBG. The correction procedure repeats for an increasing sequence k = k , . . . , k t the followingsteps 1–3:1. Construct the DBG of set Q using as the nodes the k -mers that occur in Q at least h times;2. Correct Q using the LoRDEC algorithm with this DBG;3. Replace Q with the corrected Q .After the final round, the regions of the reads identified as correct in the last iteration are extractedfor further correction with the multiple alignment technique by LoRMA.As the initial error level is assumed high, the above iterations have to start with a relativelysmall k = k . With a suitable abundancy threshold h , the DBG should then contain most ofthe correct k -mers (i.e., the k -mers of the target genome) and a few erroneous ones. Althoughpath search over long weak regions may not be feasible because of strong branching of the DBG,shorter paths are likely to be found and hence, short weak regions can be corrected. After the firstround the correct regions in the reads have become longer because close-by correct regions havebeen merged whenever a path between them has been found, and thus we can increase k . Then,with increasing k s, the DBG gets less tangled and the path search over the longer weak regionsbecomes feasible allowing for the correction of the complete reads. A similar iterative approach haspreviously been proposed for short read assembly [19, 2].When the path search is abandoned because of excessive branching, the original LoRDECalgorithm still uses the best path found so far to correct the region. Such a greedy strategyimproves correction accuracy in a single run, but in the present iterative approach false correctionsstart to accumulate. Therefore, we make a correction only if it is guaranteed that the correction isthe best one available in the DBG, i.e., all branches have been explored.Abundancy threshold h controls the quality of the k -mers that are used for correction. In ourexperiments we used a fixed threshold of h = 4 in all iterations, meaning that the k -mers with lessthan 4 occurrences in the read set were considered erroneous.3o justify the value of h , we need to analyse how many times a fixed k -mer of the genome isexpected to occur without any error in the reads. Then an h that is about one or two standarddeviations below the expected value should give a DBG that contains the majority of the correct k -mers and not too many erroneous ones. We will use an analysis similar to Miclotte et al.[16].Let C ℓ ≥ k denote the coverage of a genomic k -mer by exact regions of length at least k . Here exactregion refers to a continuous maximal error-free segment of some read in our read set. Figure 2gives an example of exact regions. Let us add a $ character to the end of each read, and thenconsider the concatenation of all these reads. In this sequence an exact region (of length 0 or more)ends either at an error or when encountering the $ character. Let n denote the number of reads, N the length of the concatenation of all reads, and p the error rate. Then the probability for anexact region to end at a given position of the concatenated sequence is q = ( pN + n ) / ( N + n ).As the reads are long and the error rate is high, we have q ≈ p . The length of the exact regionsis distributed according to the geometric distribution Geom( q ) and therefore the probability of anexact region to have length i is P ( i ) = (1 − q ) i q . The expected number of exact regions is N q . Anexact region is maximal if it cannot be extended to the left or right. Let R i be the random variabledenoting the number of maximal exact regions of length i . Then E ( R i ) = N qP ( i ) = N q (1 − q ) i .Let C ℓ = i denote the coverage of a k -mer in the genome by maximal exact regions of length i ,and let r i denote the number of maximal exact regions of length i . An exact region of length i , i ≥ k , covers a fixed genomic k -mer (i.e., the read with that exact region is read from the genomicsegment containing that k -mer) if the region starts in the genome from the starting location ofthe k -mer or from some of the i − k locations before it. Assuming that the reads are randomlysampled from the genome, this happens with probability ( i − k + 1) /G , where G is the length of thegenome. Therefore, C ℓ = i is distributed according to the binomial distribution Bin( r i , ( i − k + 1) /G )(independence of locations of exact regions is assumed), and the expected coverage of a genomic k -mer by maximal exact regions of length i is E ( C ℓ = i ) = ∞ X r i =0 P ( R i = r i ) · r i · i − k + 1 G = i − k + 1 G E ( R i )= NG q (1 − q ) i · ( i − k + 1) . By the linearity of expectation the expected coverage of a genomic k -mer by exact regions of lengthat least k is E ( C ℓ ≥ k ) = ∞ X i = k E ( C ℓ = i )= NG ∞ X i = k q (1 − q ) i · ( i − k + 1) . Because ( i − k + 1) /G is small, we can approximate the binomial distribution of C ℓ = i with thePoisson distribution. Therefore σ ( C ℓ = i ) = E ( C ℓ = i ).Assuming that the coverages of a genomic k -mer by maximal exact regions of different lengthsare independent, the variance of the coverage by exact regions of length at least k is σ ( C ℓ ≥ k ) = P i ≥ k σ ( C ℓ = i ) = E ( C ℓ ≥ k ).Figure 3 illustrates E ( C ℓ ≥ k ) for various k and q ≈ p , with 100x original coverage of the target.Note that original coverage of the target genome by the read set is N/G . For the three datasets in4
GCA - C A TAG A CGTATGATCCAATGAG - CGA T ACCTT T C A TACTGCAC C A TA
Figure 2: Division of a read into maximal exact regions, shown as green areas. The dark green areas givethe regions that could cover a 4-mer.
15 20 25 30 35 40 45 50 55 60 65051015202530 k E ( C ℓ ≥ k ) p=0.05p=0.10p=0.15p=0.20 Figure 3: Expected coverage of a genomic k -mer by exact regions of length at least k for a read set withcoverage 100x for different error rates p . our experiments (see Table 1), with coverages 200x, 208x, and 129x, the expected coverage E ( C ℓ ≥ k )has values 9.12, 9.48, and 5.89, respectively, for our initial k = 19 and for our assumed error rate p = 0 .
15. Hence our adopted threshold h = 4 is from 0.8 to 1.8 standard deviations below theexpected coverage meaning that most of the correct k -mers should be distinguishable from theerroneous ones. The error correction performed by LoRDEC ∗ does not make use of long range information containedin the reads. In particular, approximate repeats of the target are collapsed in the DBG into a pathwith alternative branches. In practise such repeat regions are corrected towards a copy of therepeat but not necessarily towards the correct copy. However, the correct copy is more likelyuncovered because we choose the path that minimises the edit distance between the weak regionto be corrected and the sequence spelled out by the path. Therefore, if we have several reads fromthe same location, the majority of them are likely corrected towards the correct copy.Our multiple alignment error correction exploits the long range similarity of reads by identifyingthe reads that are likely to originate from the same genomic location. If the reads contain a repeatarea, the most abundant copy of the repeat present in the reads is likely the correct one. Thenby aligning the reads with each other we can correct them towards this most abundant copy. Theapproach we use here bears some similarity to the method used in Coral [21].As preprocessing phase for the method, we build a DBG of all the reads using abundancy5 eads:1. AGGGACA 2. GACATTTTTCT 3. GGGAGATTTTTC 4. TTTCTCTCTADBG: GGAC GACA ACAT CATTAGGG GGGA ATTT TTTT TTTC TTCT TCTC CTCT TCTAGGAG GAGA AGAT GATT
DBG paths and read labels:
AGGG GGGA ATTT TTTT TTTC TTCT TCTC CTCT TCTA read id : read partid . For example, the path for read 2 consists of segments with labels 2:1, 2:2, 2:3, 2:4, and 2:5. threshold h = 1 to ensure that all k -mers present in the reads are indexed. Then we enumeratethe simple paths of the DBG and find for each read the unique path that spells it out. Each suchpath is composed of non-overlapping unitig segments that have no branches. We call such segmentsthe parts of a path. We associate to each path segment (i.e., a unitig path of the DBG) a set oftriples describing the reads traversing that segment. Each triple consists of read id, part id, andthe direction of the read on this path. Hence the path for a read i consists of segments who havea triplet with i as the read id and with part id values 1, 2, ..., the path being composed of thesesegments in the order of the part id value (see Fig. 4). Using this information it is now possibleto reconstruct each read from the DBG except that the reads will be prefixed (suffixed) by thecomplete simple path that starts (ends) the read.In the second phase of our method we take the reads one by one and use the DBG to selectreads that are similar to the current read. We follow the path for the current read and gather theset of reads sharing k -mers with it, which can be done using the triplets of the augmented DBG.Out of these reads we then first select each read R such that the shared k -mers span at least 80%of the shorter one of the read R and the current read. Furthermore, out of these reads we selectthose that share the most k -mers with the current read. We call this read set the friends of thecurrent read. The number of selected reads is a parameter of our method (by default 7).We then proceed to compute a multiple alignment of the current read and its friends. To keepthe running time feasible, we use the same simple method as in Coral [21]. First the current read isset to be the initial consensus. Then we take each friend of the current read one by one, align themagainst the current consensus using banded alignment, and finally update the consensus accordingto the alignment. Finally we inspect every column of the multiple alignment and correct the currentread towards the consensus if the consensus is supported by at least two reads.We implemented the above procedure in a tool called LoRMA (Long Read Multiple Alignments)using the GATB library [8] for the implementation of the DBG.6 able 1: Data sets used in the experiments E. coli (simulated)
E. coli
YeastReference organism
Name
Escherichia coli Escherichia coli Saccharomyces cerevisiae
Strain K-12 substr. MG1655 K-12 substr. MG1655 W303Reference sequence NC 000913 NC 000913 CM001806-CM001823Genome size 4.6 Mbp 4.6 Mbp 12 Mbp
PacBio data
Number of reads 92818 89481 261964Avg. read length 9997 10779 5891Coverage 200x 208x 129x
Illumina data
Accession number - ERR022075 SRR567755Number of reads - 2316613 4503422Read length - 100 100Coverage - 50x 38x
We ran experiments on three data sets that are detailed in Table 1. The simulated
E. coli data setwas generated with PBSIM [18] using the following parameters: mean accuracy 85%, average readlength 10,000, and minimum read length 1,000. The other two data sets are real data. Althoughour method works solely on the PacBio reads, the table also includes statistics of complementaryIllumina reads that were used to compare our method against hybrid methods that need also shortreads. All experiments were run on 32 GB RAM machines equipped with 8 cores.
In the simulated data set the genomic position where each read derives from is known. Thereforethe quality of error correction on the simulated data set is evaluated by aligning the corrected readagainst the corresponding correct genomic sequence. We allow free deletions in the flanks of thecorrected read because the tools trim regions they are not able to correct. To check if the correctedreads align to the correct genomic position, we aligned the corrected reads on the reference genomewith BLASR [6] keeping only a single best alignment for each read. The following statistics werecomputed: • Size : The relative size of the corrected read set as compared to the original one. • Error rate : The number of substitutions, insertions and deletions divided by the length ofthe correct genomic sequence. • Correctly aligned : The relative number of reads that align to the same genomic positionwhere the read derives from.To evaluate the quality of error correction on the real data sets, we used BLASR [6] to alignthe original and corrected reads on the reference genome. For each read we used only a single bestalignment because a correct read should only have one continuous alignment against the reference.Thus chimeric reads will be only partially aligned. We computed the following statistics: • Size : The relative size of the corrected read set as compared to the original one. • Aligned : The relative size of the aligned regions as compared to the complete read set.7 able 2: The progression of k for the iterations of LoRDEC ∗ . All results are shown for the whole correctionprocess LoRDEC ∗ +LoRMA. k progression Size (%) Aligned (%) Error rate (%) Elapsed time (h)
19 64.901 99.499 0.294 4.0819,22,25,28,31 66.702 99.302 0.276 12.9719,22,25,28,31,34,37,40,43,46 66.630 99.311 0.274 20.6519,22,25,28,31,34,37,40,43,46,49,52,55,58,61 66.546 99.296 0.271 27.5319,26,33 66.401 99.329 0.274 9.5819,26,33,40,47 66.230 99.298 0.271 13.0719,26,33,40,47,54,61 66.144 99.283 0.266 16.0819,33 66.705 99.358 0.277 7.6819,33,47 66.178 99.352 0.268 10.5819,33,47,61 65.991 99.301 0.261 11.9219,40 66.619 99.360 0.272 8.3219,40,61 66.223 99.317 0.257 10.30 • Error rate : The number of substitutions, insertions and deletions in the aligned regionsdivided by the length of the aligned regions in the reference sequence. • Genome coverage : The proportion of the genome covered by the aligned regions of thereads.Together these statistics measure three aspects of the quality of error correction. Size measures thethroughput of the method. Aligned and error rate together measure the accuracy of correction.Finally genome coverage estimates if reads deriving from all regions of the genome are corrected.
We ran experiments on the real
E. coli data set to test the effect of parameters on the performanceof our method. First we tried several progressions of k in the first phase where LoRDEC ∗ is runiteratively. We started all iterations with k = 19 because given the high error rate of the data k must be small for correct k -mers to occur in the read data. The results of these experimentsare presented in Table 2. With more iterations the size of the corrected read set and the alignedproportion of reads decrease, but the aligned regions are more accurate. The decrease in thesize of the corrected read set may be a result of better correction because PacBio reads have moreinsertions than deletions. However, the decrease in the aligned proportion of the reads may indicatesome accumulation of false corrections. The runtime of the method increases with the number ofiterations but later iterations take less time as the reads have already been partially correctedduring the previous rounds. To balance out these effects, we chose to use a moderate number ofiterations, i.e. k = 19 , ,
61, by default which also optimises the error rate of the aligned regions.LoRMA also builds a DBG of the reads and thus we need to specify k . We investigated theeffect of the value of k on the E. coli data set. Table 3 shows the effect of k on the performance ofLoRMA. Because the DBG is only used to detect similar reads in LoRMA, the performance is notgreatly affected by the choice of k . There is a slight decrease in the throughput of the method as k increases as well as a slight increase in runtime but these effects are very modest. For the rest ofthe experiments we set k = 19.Another parameter of the method is the size of the set of friends of the current read (-friendsparameter). We tested also the effect of this parameter on the E. coli data set. As the optimal valueof this parameter might depend on the coverage of the data set, we created several subsets of thisdata set with different coverage to investigate this. Table 4 shows the results of these experiments.8 able 3: The effect of the k -mer size in LoRMA. All results are shown for the whole correction processLoRDEC ∗ +LoRMA. k Size Aligned Error rate Elapsed time Memory peak(%) (%) (%) (h) (GB)
19 66.238 99.306 0.256 10.38 17.19740 66.170 99.309 0.258 10.53 16.95861 65.941 99.313 0.261 13.87 16.908
We can see that the accuracy of the correction increases as the size of the friends set increases.However, for the data set with the lowest coverage, 75x, the coverage of the genome by the correctedreads decreases when the size of the friends set is increased indicating that lower coverage areas arenot well corrected. We can also see that increasing the size of the friends set increases the runningtime of the method. In the interest of keeping the running time reasonable, we decided to set thedefault value of the parameter at a fairly low value, 7.
We compared our new method against PBcR [10, 3] which is to the best of our knowledge the onlyprevious selfcorrection method for long reads, and LoRDEC [20], proovread [9] and Jabba [16] whichalso use short complementary reads. Table 5 shows the results on the simulated data set comparingour new method to PBcR using long reads only. Table 6 shows the results of the comparison of ournew method against previous methods on the real data sets. In the following we will use LoRDECto refer to the hybrid correction method using also short reads and LoRDEC ∗ +LoRMA for ournew method in which LoRDEC ∗ is run in long reads selfcorrection mode followed by LoRMA.PBcR pipeline from Celera Assembler version 8.3rc2 was run without the assembly phase andmemory limited to 16 GB. PBcR was run both only using PacBio reads and by utilising also theshort read data. For PBcR utilising also short read data, the PacBio reads were divided into threesubsets each of which was corrected in its own run. Proovread v2.12 was run with the sequence/fastqfiles chunked to 20M as per the usage manual and used 16 mapping threads. LoRDEC used anabundancy threshold of 3 and k -mer size was set to 19 similar to the experiments by Salmela andRivals [20]. Jabba 1.1.0 used k -mer size 31 and short output mode. LoRMA was run with 6 threads.The k -mer sizes for LoRDEC ∗ +LoRMA iteration steps were chosen 19, 40 and 61. For proovreadand LoRDEC we present results for trimmed and split reads.Table 5 shows that on the simulated data both PBcR and LoRDEC ∗ +LoRMA are able tocorrect most of the data. Our new method achieves a lower error rate and higher through-put. We see that the fraction of corrected reads aligning to the correct genomic position islower for LoRDEC ∗ +LoRMA than for PBcR when all reads are considered, which suggests thatLoRDEC ∗ +LoRMA tends to overcorrect some reads. However, for corrected reads longer than2000 bp this difference disappears and thus we can conclude that the overcorrected reads are short.When compared to the other selfcorrection method, PBcR, our new tool has a higher throughputand produces more accurate results on both real data sets as shown in Table 6. Out of the hy-brid methods, Jabba has a lower error rate than LoRDEC ∗ +LoRMA but its throughput is lower.When compared to the other hybrid methods, LoRDEC ∗ +LoRMA has comparable accuracy andthroughput. All hybrid methods produce corrected reads that do not cover the whole E. coli refer-ence, which could be a result of coverage bias in the Illumina data. On the yeast data proovreadproduced few corrected reads and thus the coverage of the corrected reads is very low.Table 6 shows that our method is slower and uses more memory than PBcR in selfcorrectionmode but its disk usage is lower. On the
E. coli data set our new method is faster than proovread9 able 4: The effect of the size of the friends set on the quality of the correction. All results are shown forthe whole correction process LoRDEC ∗ +LoRMA. Coverage 75xFriends 5 7 10 15 20Size (%)
Aligned (%)
Error rate (%)
Gen. cov. (%)
Elapsed time (h)
Memory (GB)
Disk (GB)
Coverage 100xFriends 5 7 10 15 20Size (%)
Aligned (%)
Error rate(%)
Gen. cov. (%)
Elapsed time (h)
Memory (GB)
Disk (GB)
Coverage 175xFriends 5 7 10 15 20Size (%)
Aligned (%)
Error rate(%)
Gen. cov. (%)
Elapsed time (h)
Memory (GB)
Disk (GB) and PBcR utilising short read data but slower than LoRDEC, Jabba or PBcR using only PacBiodata. On the yeast data set we are faster than PBcR in hybrid mode but slower than the others.On the
E. coli and yeast data sets, LoRDEC ∗ +LoRMA uses 45% and 37%, respectively, of itsrunning time on LoRDEC ∗ iterations. On both data sets the error rate of the reads after LoRDEC ∗ iterations and trimming was 0.5%. Especially for larger genomes it is of interest to know how much coverage is needed for the errorcorrection to succeed. We investigated this by creating random subsets of the
E. coli data set withcoverages 25x, 50x, 100x, and 150x. We then ran our method and PBcR [10, 3] on these subsets toinvestigate the effect of coverage on the error correction performance. Table 7 shows the results of
Table 5: Comparison of LoRDEC ∗ +LoRMA against PBcR (PacBio only) on the simulated E. coli data set
Tool Size Error Correctly Correctly aligned Elapsed Memory Diskrate aligned ≥ Original 100.000 13.015 99.997 99.997 - - -PBcR (PacBio only) 92.457 0.604 99.953 99.984 2.63 9.066 17.823LoRDEC ∗ +LoRMA 94.372 0.109 96.866 99.987 14.30 17.338 3.192 able 6: Comparison of both hybrid and selfcorrection tools on PacBio data. Results for tools utilising alsoIllumina data are shown on a grey background. Tool Size Aligned Error rate Genome Elapsed Memory Diskcoverage time peak peak(%) (%) (%) (%) (h) (GB) (GB) E . c o li Original 100.000 71.108 16.9126 100.000 - - -LoRDEC 65.672 98.944 0.1143 99.820 0.96 0.368 1.570proovread 61.590 98.603 0.2789 99.728 28.65 9.522 7.174PBcR (with Illumina) 52.103 98.507 0.0682 98.769 15.13 17.429 160.154Jabba 2.873 99.945 0.0003 99.745 0.02 0.168 0.606PBcR (only PacBio) 51.068 86.023 0.6905 100.000 1.68 22.00 16.070LoRDEC ∗ +LoRMA 66.223 99.318 0.2572 100.000 10.40 16.984 2.824 Y e a s t Original 100.000 89.929 16.8442 99.974 - - -LoRDEC 75.522 97.337 0.9987 99.833 3.17 0.451 2.776proovread 0.306 97.156 0.8004 20.346 11.18 4.764 7.162PBcR (with Illumina) 57.337 98.100 0.3342 99.652 22.05 20.085 157.726Jabba 24.979 99.484 0.1279 99.900 0.17 1.031 0.993PBcR (only PacBio) 60.065 95.822 2.1018 99.907 4.42 9.571 24.610LoRDEC ∗ +LoRMA 71.987 98.088 0.3644 99.375 21.08 17.968 4.852 Table 7: The effect of coverage of the PacBio read set on the quality of the correction.
LoRDEC ∗ +LoRMA PBcRCoverage 25x 50x 100x 150x 208x 25x 50x 100x 150x 208xSize (%) Aligned (%)
Error rate (%)
Gen. cov. (%)
Time (h)
Memory (GB)
Disk (GB) these experiments. The other tools, LoRDEC, Jabba and proovread, use also the complementaryIllumina reads and the coverage of PacBio reads does not affect their performance.When the coverage is high, the new method retains a larger proportion of the reads than PBcRand is more accurate, whereas when the coverage is low, PBcR retains more of the data and a largerproportion of it can be aligned. However, the error rate remains much lower for our new tool. Thereads corrected by PBcR also cover a larger part of the reference when the coverage is low.
We have presented a new method for correcting long and highly erroneous sequencing reads. Ourmethod shows that efficient alignment free methods can be applied to highly erroneous long readdata. The current approach needs alignments to take into account the global context of errors.Reads corrected by the new method have an error rate less than half of the error rate of readscorrected by previous selfcorrection methods. Furthermore, the throughput of the new method is20% higher than previous selfcorrection methods with read sets having coverage at least 75x.Recently several algorithms for updating the DBG instead of constructing it from scratch when k changes have been proposed [4, 5]. However, these methods are not directly applicable to ourmethod because also the read set changes when we run LoRDEC ∗ iteratively on the long reads.Our method works solely on the long reads, whereas many previous methods require also short11ccurate reads produced by e.g. Illumina sequencing, which can incorporate sequencing biases inPacBio reads. This could have very negative effect on sequence quality, especially since Illuminasuffers from GC content bias and some context dependent errors [22, 17].As further work we plan to improve the method to scale up to mammalian size genomes. We willinvestigate a more compact representation of the path labels in the augmented DBG to replace thesimple hash tables currently used. Construction of multiple alignment could also be improved byexploiting partial order alignments [14] which have been shown to work well with PacBio reads [7].Another direction of further work is to investigate the applicability of the new method on longreads produced by the Oxford NanoPore MinION platform. Laver et al. [13] have reported an errorrate of 38.2% for this platform and they also observed some GC content bias. Both of these factorsmake the error correction problem more challenging and therefore it will be interesting to see acomparison of the methods on this data. Funding
This work was supported by the Academy of Finland (grant 267591 to L.S.), ANR Colib’read (grantANR-12-BS02-0008), IBC (ANR-11-BINF-0002), and D´efi MASTODONS to E.R., and EU FP7project SYSCOL (grant UE7-SYSCOL-258236 to E.U.).
References [1] K. F. Au et al. Improving PacBio long read accuracy by short read alignment.
PLoS ONE ,7(10):e46679, 10 2012.[2] A. Bankevich et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing.
Journal of Computational Biology , 19(5):455–477, 2012.[3] K. Berlin et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
Nat Biotechnol , 33:623–630, 2015.[4] C. Boucher et al. Variable-order de Bruijn graphs. In
Proc. DCC 2015 , pages 383–392, 2015.[5] B. Cazaux, T. Lecroq, and E. Rivals. From indexing data structures to de Bruijn graphs. In
Proc. CPM 2014 , pages 89–99, 2014.[6] M. J. Chaisson and G. Tesler. Mapping single molecule sequencing reads using basic localalignment with successive refinement (BLASR): application and theory.
BMC Bioinformatics ,13:238, 2012.[7] C.-S. Chin et al. Nonhybrid, finished microbial genome assemblies from long-read SMRTsequencing data.
Nature Methods , 10:563–569, 2013.[8] E. Drezen et al. GATB: genome assembly & analysis tool box.
Bioinformatics , 30(20):2959–2961, 2014.[9] T. Hackl et al. proovread: large-scale high accuracy PacBio correction through iterative shortread consensus.
Bioinformatics , 30(21):3004–3011, 2014.[10] S. Koren et al. Hybrid error correction and de novo assembly of single-molecule sequencingreads.
Nat Biotechnol , 30(7):693–700, 2012.1211] S. Koren and A. M. Philippy. One chromosome, one contig: complete microbial genomes fromlong-read sequencing and assembly.
Current Opinion in Microbiology , 23:110–120, 2015.[12] D. Laehnemann, A. Borkhardt, and A. C. McHardy. Denoising DNA deep sequencingdata – high-throughput sequencing errors and their correction.
Briefings in Bioinformatics ,17(1):154–179, 2016.[13] T. Laver et al. Assessing the performance of the Oxford Nanopore Technologies MinION.
Biomolecular Detection and Quantification , 3:1–8, 2015.[14] C. Lee, C. Grasso, and M. F. Sharlow. Multiple sequence alignment using partial order graphs.
Bioinformatics , 18(3):452–464, 2002.[15] M.-A. Madoui et al. Genome assembly using Nanopore-guided long and error-free DNA reads.
BMC Genomics , 16:327, 2015.[16] G. Miclotte et al. Jabba: Hybrid error correction for long sequencing reads using maximalexact matches. In
Proc. WABI 2015 , volume 9289 of
LNBI , pages 175–188. Springer, 2015.[17] K. Nakamura et al. Sequence-specific error profile of Illumina sequencers.
Nucleic AcidsResearch , 39(13):e90, 2011.[18] Y. Ono, K. Asai, and M. Hamada. PBSIM: PacBio reads simulator – toward accurate genomeassembly.
Bioinformatics , 29(1):119–121, 2013.[19] Y. Peng et al. IDBA – a practical iterative de Bruijn graph de novo assembler. In
Proc.RECOMB 2010 , volume 6044 of
LNBI , pages 426–440. Springer, 2010.[20] L. Salmela and E. Rivals. LoRDEC: accurate and efficient long read error correction.
Bioin-formatics , 30(24):3506–3514, 2014.[21] L. Salmela and J. Schr¨oder. Correcting errors in short reads by multiple alignments.
Bioin-formatics , 27(11):1455–1461, 2011.[22] M. Schirmer et al. Insight into biases and sequencing errors for amplicon sequencing with theIllumina MiSeq platform.
Nucleic Acids Research , 43(6):e37, 2015.[23] X. Yang et al. A survey of error-correction methods for next-generation sequencing.