[PDF] Comprehensive assessment of error correction methods for high-throughput sequencing data

Abstract

The advent of DNA and RNA sequencing has revolutionized the study of genomics and molecular biology. Next generation sequencing (NGS) technologies like Illumina, Ion Torrent, SOLiD sequencing etc. have brought about a quick and cheap way to sequence genomes. Recently, third generation sequencing (TGS) technologies like PacBio and Oxford Nanopore Technology (ONT) have also been developed. Different technologies use different underlying methods for sequencing and are prone to different error rates. Though many tools exist for error correction of sequencing data from NGS and TGS methods, no standard method is available yet to evaluate the accuracy and effectiveness of these error-correction tools. In this study, we present a Software Package for Error Correction Tool Assessment on nuCLEic acid sequences (SPECTACLE) providing comprehensive algorithms to evaluate error-correction methods for DNA and RNA sequencing, for NGS and TGS platforms. We also present a compilation of sequencing datasets for Illumina, PacBio and ONT platforms that present challenging scenarios for error-correction tools. Using these datasets and SPECTACLE, we evaluate the performance of 23 different error-correction tools and present unique and helpful insights into their strengths and weaknesses. We hope that our methodology will standardize the evaluation of DNA and RNA error-correction tools in the future.

Full PDF

1 Comprehensive assessment of error correction methods for high-throughput sequencing data

Yun Heo, Gowthami Manikandan, Anand Ramachandran, Deming Chen ([email protected]) University of Illinois at Urbana-Champaign ABSTRACT

Keywords: Next Generation Sequencing; Third Generation Sequencing; error correction; error correction evaluation; error analysis

INTRODUCTION

Rapid improvements in next-generation sequencing (NGS) technologies have allowed us to generate a huge amount of sequencing data at a low cost. However, the quality of the data has ONT’s

MinION reads have an error rate of 38.2% [27]. Also, the dominant error types of these technologies are insertions and deletions that are rare in Illumina reads. Due to these characteristics, dedicated error-correction methods for PacBio reads [28-31] and Oxford Nanopore reads [32-35] have been developed. Despite such a large number of error-correction methods, only a few studies exist that are dedicated to the evaluation of the accuracy of these methods. Such scarcity is mainly due to the difficulty involved in discerning how many errors were corrected and how many were newly generated in the error-correction process. While checking if substitution errors have been corrected is easily done by measuring the Hamming distance between a reference sequence and a corrected read, it is not so simple to evaluate how accurately errors are corrected when insertions and deletions also exist as errors. The evaluation becomes more complex when reads are trimmed during error correction. Aligning a read to the source genome does not always provide the right solution since multiple best alignments can exist [5]. Heterozygosity also makes the evaluation hard. In a diploid genome, the same locus in a pair of chromosomes could have different alleles. Therefore, one of them can potentially be recognized as a sequencing error if reads from heterozygous regions are compared with one reference sequence. To the best of our knowledge, only a handful of research works have been carried out to quantitatively evaluate how exactly errors in NGS reads have been corrected. Error Correction Evaluation Toolkit (ECET) [36] is an error-correction evaluation platform that consists of two software packages, one of which evaluates Illumina reads and the other, 454 or Ion Torrent reads. The reason for having two separate algorithms for dealing with different technologies is that the dominant error models of 454 and Ion Torrent reads are insertions or deletions in homopolymers while most errors in Illumina reads are substitutions [37, 38]. Another evaluation work by Molnar et al. [39] determines the correctness of reads or k -mers in the outputs from Illumina error-correction tools instead of directly checking the correctness of bases. Their method calculates (1) how many error-free reads or k -mers cover each base in a genome and (2) how many bases in a reference sequence are covered by error-free reads or k -mers, then checks how the two numbers are changed by error correction. Another evaluation methodology, compute_gain, is a program that is a part of an error-correction tool package Fiona [5]. It aligns both a read and its corrected version to a reference sequence, and calculates the difference in edit distance between the two alignments. Ambiguities in alignments are resolved by placing gaps at the leftmost or rightmost possible position. Even though the three methods opened up ways of evaluating the outputs from error-correction methods, all of them have limitations. The software package for Illumina reads in ECET can only work with the tools that explicitly specify the number of bases trimmed from both ends of reads. Even when this information is available, separate programs for each error-correction tool are needed to extract the number of trimmed bases, because the tools output the number in different ways. The software package for 454 or Ion Torrent reads in ECET can evaluate reads with insertions and deletions but the evaluation results could be inaccurate for trimmed reads. Figure S2 in the supplementary document shows some examples in which ECET evaluates reads incorrectly. Even though the software reported in [39] can be applied to the outputs from any Illumina error-correction method, it may not be applicable to other sequencing technologies. Since PacBio reads, for example, have a high error rate and the errors are evenly distributed in the reads, it is hard to get error-free k -mers of sufficient length. If short k -mers are used by this tool for the evaluation of PacBio reads, the specificity of the evaluation would be low because it is likely that the same or similar k -mers exist in other parts of the genome sequence as well. The evaluation results of compute_gain, like that of ECET, could be inaccurate in some cases. Since the alignment scores used in compute_gain were designed to evaluate edit distance, a read could be aligned to a reference sequence in totally different ways before and after error correction, which makes it possible for the evaluation result to be inaccurate (see Figure S3 in the supplementary document). In addition to these methods, evaluations are presented when individual error-correction tools are introduced. For Illumina tools, quality of error correction is evaluated using counts of uncorrected errors and corrected errors, and expressed in the form of metrics such as sensitivity, precision and gain [6-16]. The error counts used are obtained by mapping the reads to the reference, as reported in [36]. We outlined in paragraphs above how current mapping-based methods can give inaccurate evaluation results. Literature reporting new TGS error-correction tools on the other hand, do not report error count metrics such as sensitivity or gain. Instead, performance is measured based on improvements in downstream assembly and alignment results [28-25]. While such results give a good picture of how much improvement is made by error correction, there can be variations in such measurements based on the specific assembly or alignment tools used to obtain these metrics. In this work, we try to address limitations of existing error-correction evaluation methods. We develop a new algorithm for evaluation of error-correction accuracy in terms of error count statistics that avoids the pitfalls of current alignment-based evaluation methods. This algorithm can be applied across different sequencing platforms from the NGS and TGS paradigms, allowing a uniform and standard error-correction evaluation method across technologies. In addition, we introduce new metrics for error-correction tool study that provide additional insights regarding design limitations of a given tool, giving pointers on how the tool may be improved. Using these new methods, we perform a comprehensive analysis of many state-of-the-art error correction tools and report error count statistics, alignment and assembly statistics, as well as additional metrics for understanding tool behavior. More specifically, the key contributions of this work can be summarized as follows: 1) We have developed a new error-correction tool evaluation algorithm that is generic across different sequencing error models and error rates. It can quantitatively evaluate any error-correction tool for NGS and TGS reads. It works for both DNA and RNA sequencing data, and can differentiate heterozygous alleles from sequencing errors. 2)

We have designed input read sets that highlight the challenges in error correction such as heterozygosity, coverage variation, and repeats (the input datasets are available for download). These reads can be used as standard inputs for the evaluation of error-correction tools. 3)

We have evaluated and compared 23 state-of-the-art error correction tools for NGS and TGS reads using SPECTACLE with these data sets as inputs. From SPECTACLE, we report many statistics pertaining to error correction like sensitivity, gain, precision, F1-Score, percentage similarity of reads, NG50 length, supporting read coverage, alignment quality of corrected reads, performance of the tool with respect to read position etc. for both NGS and TGS reads. This will give users systematic evaluations of strengths and weaknesses of the tools and indicate potential ways for their further improvement. In the sections that follow, we will explain how we prepared the inputs for our evaluation and how the evaluation algorithm works. We then present and discuss the evaluation results and what should be done in the future. EVALUATION METHOD

Figure 1.

Flowchart of evaluating corrected DNA sequencing reads using SPECTACLE. (A) Evaluation flow for simulated reads. (B) Evaluation flow for real reads.

Figure 1 shows the SPECTACLE flows for evaluating error-correction tools with DNA simulated reads and DNA real reads. Each flow consists of two steps. In the first step, the locations of errors in input reads are determined, and in the next step this information is used to evaluate the output of an error-correction tool. The two steps will be explained in detail in the following subsections. The basic flow for evaluating RNA error-correction tools is similar and is explained in the supplementary document. Preparing Input Data

SPECTACLE supports using both simulated reads and real reads to utilize their unique strengths. With simulated reads, we can determine the exact locations of errors in the reads. Moreover, reads can be generated from multiple reference sequences with some differences in order to check whether an error-correction tool is able to differentiate heterozygosity from sequencing errors. The biggest advantage of using real reads is that no assumptions or modeling artifacts exist behind the sequencing data. Therefore, real reads can have some interesting properties that may not be accurately modeled in simulated reads. On the other hand, there can be ambiguities in finding error locations in real reads. In order to find the error locations in real reads, the reads need to be aligned to a reference sequence, and this can cause some problems. First, it is possible that a read can be aligned to multiple similar locations in a reference sequence (or to the same location in different ways), and determining the correct alignment is sometimes impossible. In the case of highly repetitive genomes, ambiguous alignments occur frequently, raising the chances of inaccurate evaluation results. Second, reads and a reference sequence might come from different samples, and the differences between them (i.e., variants) may be recognized as errors in this step. Third, the evaluation results will depend on the accuracy of the alignment tool. SPECTACLE can work with the output reads from any read simulator that gives error location information in a Sequence Alignment/Map (SAM) format. However, in our study we used pIRS [40] exclusively for generating simulated Illumina reads. Error correction becomes challenging when there are heterozygosity and read coverage variations [3, 41], and pIRS can produce reads with these characteristics. pIRS can generate reads using a diploid genome, and the reads have both sequencing errors and heterozygosity. Second, pIRS can change read coverage depth of a specific genomic region according to the GC-content of the region. Figure 1A depicts the evaluation flow for simulated reads. First, two reference sequences Ref

Ref

0. Once the two sequences are created, reads are generated from

Ref

2. The maximum ploidy level that SPECTACLE supports is two. After the reads are generated, the locations of errors in the reads should be written in an error location file F L . F L contains 1) the positions where reads originate in the genome, 2) the locations of substitutions, insertions, and deletions in each read, and 3) reference sequence from which each read was sampled (i.e. Ref

Ref F L . In order to simulate PacBio reads, we used PBSIM [42]. PBSIM generates a Mutation Annotation Format (MAF) file for indicating error locations, and the file is converted to F L . Because these TGS technologies do not use amplification techniques that cause higher error rates in regions in the genome that have higher G and C base content, the coverage variation due to different GC-content values was not considered in generating the simulated reads for PacBio. These TGS reads are generated from a single reference sequence because their error rate is much higher than the frequency of heterozygous sites and we do not expect the evaluation results to be altered appreciably by adding heterozygous points. However, a real dataset is used for evaluating PacBio reads that is heterozygous. Figure 1B shows the evaluation flow for real reads. If input reads and a reference sequence

Ref

1, is generated by calling the variants and applying them to

Ref

0. In our evaluation, BWA [43] and SAMtools [44] were used for variant calling. The variants are added to

Ref

1, and the alignment results in the SAM file are converted to F L . Among the substitution errors in F L , the errors generated by heterozygous alleles are removed by comparing F L with the variant calling result. Evaluating the Accuracy of Corrected Reads

Let R C be the corrected version of a read R . In order to evaluate the accuracy of R C , we should find corrected errors and newly added errors in R C . SPECTACLE first takes the segment G R from a reference sequence where read R was sampled. Then, R C is aligned to G R to find the errors in R C (errors missed, or introduced by a tool). For Illumina reads, we implemented a modified version of the Gotoh algorithm [46] for handling trimmed bases and for extracting all the alignments with the best alignment score. This algorithm is explained in detail in the supplementary document. There can be a set of alignments, ALN

BEST , having the highest alignment score for a read R C , but each alignment could imply different numbers of corrected and newly introduced errors. Instead of arbitrarily picking one of the alignments from ALN

BEST to count the number of errors in R C , SPECTACLE introduces additional criteria that rank each of the alignments in ALN

BEST based on the error-correction accuracy in each case. Specifically, SPECTACLE calculates a penalty score based on newly introduced errors for each alignment in

ALN

BEST , utilizing the scores used in the alignment step. Then, the alignment, aln BEST , from

ALN

BEST that has the least penalty is chosen. SPECTACLE makes the choice using the following equation, where

ERR ( aln ) and ERR ( R ) are the sets of errors in an alignment aln and R and ERR ( aln ) \ERR ( R ) is the set of errors in aln but not in R. 𝑎𝑙𝑛 𝐵𝐸𝑆𝑇 = argmax 𝑎𝑙𝑛∈𝐴𝐿𝑁

𝐵𝐸𝑆𝑇 ∑ 𝑝𝑒𝑛𝑎𝑙𝑡𝑦(𝑒𝑟𝑟) 𝑒𝑟𝑟∈(𝐸𝑅𝑅(𝑎𝑙𝑛)\𝐸𝑅𝑅(𝑅))

After aln

BEST is chosen, we can compute from it, how many errors in

ERR ( R ) are corrected and how many errors are newly added during correction. In the case of Illumina reads, the evaluation of aln BEST from the set of best alignments,

ALN

BEST , follows a brute-force enumeration technique. However, for TGS reads that have long read lengths with indel errors and high error rates, there may be a very large number of candidates to be evaluated, making this enumeration method infeasible. To solve this problem for TGS reads, we used a different implementation that combines a dynamic programming-based enumeration of aln

BEST with the alignment algorithm. However, to simplify the implementation it currently uses a fixed and simpler scoring scheme, namely, a score of -1 is assigned to gaps and mismatches and a score of +1 is assigned to matches. Using this implementation, we can evaluate the same error count statistics for TGS reads as for Illumina reads, albeit with a limited scoring option. We plan to improve this to provide scoring flexibility in the future. The dynamic programming recursions are explained in detail in the supplementary document. In order to classify the bases in input reads, we introduce a notation consisting of a triplet, each character of which should be either Y or N. The first character indicates whether the base in the original read is correct (Y) or not (N), the second character indicates whether the base has been modified by an error correction tool (Y) or not (N), and the third one indicates whether the base in the corrected read at that position is correct (Y) or not (N). For example, NYY describes a base that is erroneous in R , modified by an error correction tool, and error-free in R C . All the bases should fall into one of the five categories: NNN, NYN, NYY, YNY, and YYN because YYY, YNN, and NNY are logically meaningless. Using these triplets, the accuracy metrics that are summarized in Table 1 are calculated. Table 1.

Accuracy metrics.

Metrics

Equations

Sensitivity sum(NYY) / (sum(NYY) + sum(NYN) + sum(NNN))

Gain (sum(NYY) - sum(YYN) - sum(NYN)) / (sum(NYY) + sum(NYN) + sum(NNN))

Specificity sum(YNY) / (sum(YYN) + sum(YNY))

Precision sum(NYY) / (sum(NYY) + sum(YYN) + sum(NYN))

F-score

SPECTACLE also can calculate and report the percentage similarity of reads for error correction evaluation. This feature is mainly intended for long reads. Percentage similarity of a read set S R is defined using the following equation, where N RM , N RMM , N RI , and N RD are the number of matched bases, the number of mismatched bases, the number of inserted bases, and the number of deleted bases in the alignment result of R , respectively: Percentage Similarity = ∑ 𝑁 𝑅𝑀 𝑁 𝑅𝑀 +𝑁 𝑅𝑀𝑀 +𝑁 𝑅𝐼 + 𝑁 𝑅𝐷𝑅∈𝑆 𝑅 SPECTACLE calculates percentage similarity both for input reads and for their error correction results, and shows how this number is improved after error correction. Most TGS error correction methods trim uncorrected regions in reads. After this process, R C could be split into multiple pieces and become much shorter than R . Therefore, SPECTACLE also reports read coverage that indicates how long total read length (after trimming) is and NG50 [1] that shows how long the average read length is. In addition to these metrics, SPECTACLE can report other detailed analyses such as those related to supporting read coverage which help users understand the characteristics of an error-correction tool in depth. Figure 2 explains a supporting read, supporting read coverage, and differential supporting read coverage. An error in a read becomes difficult to correct if the corresponding correct base has low supporting read coverage. This is because most error-correction tools recognize bases with low supporting read coverage as errors. Low differential supporting read coverage also makes error correction harder, because then both a correct base and an erroneous base have a similar number of supporting reads. SPECTACLE gives the percentage of corrected bases against supporting read coverage for correct bases, and the Figure 2.

Supporting reads and supporting read coverage. Supporting reads are the reads that include a specific position of a reference sequence with a specific base at the position. In the left side, there is a read CGTTAA with an erroneous base T, and three more correct reads are also sampled there. In this example, the number of supporting reads (i.e. supporting read coverage) for T at that position of the reference sequence is 1, while supporting read coverage for C is 3. However, there is another similar sequence in the reference sequence (i.e. repeats) and the reads sampled at the right region could be supporting read for T at the left side, which makes it hard to correct the error. Differential supporting read coverage of an erroneous base can be defined as (supporting read coverage of correct base) - (supporting read coverage for the erroneous base). percentage of corrected bases against differential supporting read coverage. This metric helps in evaluating how sensitive an error correction tool is to variations in read coverage. SPECTACLE collects the percentage of corrected bases in each position of reads (i.e., point sensitivity). Based on this, users can judge whether an error-correction tool can correct errors in a specific region of reads or not. This report can allow SPECTACLE users to discern how the output of an error-correction tool can be polished further, how multiple error-correction algorithms can be combined, and how an error-correction algorithm can be improved. SPECTACLE also reports measurements that provide an idea about how good the corrected reads are in the context of downstream analyses. One of the most intuitive ways to evaluate these is to count the number of corrected reads that can be aligned to a reference sequence without mismatches or indels. However, this result can be misleading when reads are aligned to wrong positions in a reference sequence. In order to avoid this, SPECTACLE has the capability to compare the aligned locations of reads in a SAM format with F L . If insertions or deletions in a read are corrected, the aligned position of the read can be shifted. SPECTACLE determines the largest possible amount of shift in the aligned positions for each read using the number of insertions and deletions, and then reports the number of reads aligned correctly within this predicted range. The average number of times each base in the reference sequence is covered by error-free reads (i.e. error-free read coverage) and the fraction of a reference sequence that is covered by error-free reads (i.e. chromosome coverage) are important metrics that indicate the quality of a read set [39]. SPECTACLE collects the two numbers using the exact alignment result described above. RESULTS

We evaluated 17 Illumina read error-correction tools, four PacBio and two ONT read error-correction methods using SPECTACLE. All the experiments were done on a cluster, each computing node of which had two six-core Intel Xeon X5650 processors and 24 GB of memory. In the following sections, we have included only selected results that highlight the strengths and weaknesses of the tools. The remaining results, software versions, and software command line options are available in the supplementary document.

DATA PREPARATION

Preparing Illumina Read Sets

As discussed above, coverage variation, heterozygosity, and repeats complicate error correction, and all the three factors were considered when we prepared input reads for our evaluation. The Illumina read sets we prepared are described in Table 2. Five different genomes I1-I5 were used to generate simulated read sets. Even though high coverage read sets are popular, correcting errors in low-coverage reads is still important. For example, cancer genome samples could be the mixture of cancer genomes and normal genomes, and the portion of one of the genomes could be very low [47]. Error-correction tools for such genomes should have the capability to correct errors in low-coverage reads. Therefore, read sets having both high and low coverage values are considered, and the coverage value is indicated using the postfixes -10X, -20X, -30X, and -40X. Coverage ranges from 10x to 40x have been picked to be consistent with base datasets used in other works reporting and validating error-correction methods. Table 2.

Details of Illumina read sets. ID Reference

Read

Species

Accession

Number G L (Mbp) GC (%)

Length

Cov. (X)

Error Rate (%)

Avg.

Std.

I1-10X

R. sphaeroids

NC_007488.1

NC_007489.1

NC_007490.1

NC_007493.1 I1-20X I1-30X I1-40X I2-10X

B. cereus

ATCC 10987

NC_003909.8

NC_005707.1 I2-20X I2-30X I2-40X I3-10X

O. sativa Chr. 5

NC_008398.2 I3-20X I3-30X I3-40X I4-10X

Mouse Chr. Y

NC_000087.7 I4-20X I4-30X I4-40X I5-10X

Human Chr. 1

NC_00001.11 I5-20X I5-30X I5-40X I6 B. cereus

ATCC 10987

NC_003909.8

NC_005707.1 [ G L ] genome length without Ns; [GC Avg.] average GC contents; [GC Std.] GC contents standard deviation; [Cov.] read coverage; [Error Rate] ( + + ) / . I1, I2, and I3 are

E. Coli Because the coverage of the reads is over 2,500 X, we down-sampled the reads to 40 X. Details regarding the down-sampling can be found in the supplementary document.

Preparing PacBio Read Sets

The read sets used for evaluating PacBio error-correction tools are shown in Table 3. The PacBio error-correction tools evaluated in this study require, in addition to PacBio reads, Illumina reads that are much more accurate than the PacBio reads as most PacBio error-correction tools use high quality Illumina short reads to detect and correct errors. These Illumina reads are described in the "Illumina" column of Table 3. In order to evaluate the effect of Illumina read coverage on the accuracy of error correction for PacBio reads, we prepared four different Illumina read sets with different read coverage values (suffixed -10X, -20X, -30X, and -40X). 40X-EF is an error-free version of the 40X read set and was used to evaluate the effects of sequencing errors in Illumina reads on error correction for PacBio reads.

Table 3.

Details of PacBio read sets. ID Reference

PacBio

Illumina

Species

Accession

Number G L (Mbp) Length (bp)

Cov. (X)

Error

Rate (%)

Accession

Number

Length (bp)

Cov. (X)

Error

Rate (%)

P1-10X

E. coli

NC_000913.3 SRR922409 P1-20X

SRR922409 P1-30X

SRR922409 P1-40X

SRR922409 P2-10X

Human Chr19

10 Mbp

NC_000019.10 N/A P2-20X

N/A P2-30X

N/A P2-40X

N/A P2-40X-EF

N/A [ G L ] genome length; [Cov.] read coverage; [Error Rate] ( + + ) / . P1 is E . coli K12 M1665 strain, and both the PacBio reads and the Illumina reads are real reads. The PacBio reads were downloaded from Pacific Biosciences DevNet (https://github.com/PacificBiosciences/DevNet/wiki/E%20coli%20K12%20MG1655%20Hybrid% P2 is the first 10 Mbp region of human chromosome 19, which was used for evaluating the scalability of the PacBio error correction tools. We first tried using the entire human chromosome 19. However, only LoRDEC could be finished within 70 hours, which is the maximum allocated run time in our cluster; as a result, we had to use a portion of the chromosome. The PacBio reads and the Illumina reads for P2 were simulated using PBSIM and pIRS, respectively. PacBio reads with lengths shorter than 500 bp were filtered out because short PacBio reads have a lot of errors (due to the high error rates in TGS sequencing) compared to the corresponding Illumina reads that have similar lengths. Hence, short PacBio reads would not have any major benefit compared to paired-end Illumina reads and are ignored for correction and evaluation.

Preparing Oxford Nanopore (ONT) Read Sets

Table 4 shows the details of the ONT datasets that have been used for the evaluation of ONT error-correction tools. ONT is a relatively newer technology, and ONT read simulation and error-correction techniques are currently maturing. As a result, are fewer ONT datasets that are publicly available for testing and evaluation. ONT error-correction tools also use short Illumina reads of high quality for error correction and the details of the Illumina reads used have been mentioned in Table 4. Both O1 and O2 are real read sets. O1 is

E. Coli K12

M1665 strain. The raw reads were downloaded from GigaDB (http://gigadb.org/dataset/view/id/100102/token/S30Hp9ZurcARyhov). O2 is

Saccharomyces cerevisiae W303 Table 4.

Details of ONT (MinION) read sets. ID Reference

MinION reads

Illumina

Species

Accession

Number G L (Mbp) Length (bp)

Error

Rate (%)

Accession

Number

Length (bp)

Cov. (X)

Error

Rate (%)

O1-10X

E. coli

NC_000913.3

SRR922409 O1-20X

SRR922409 O1-30X

SRR922409 O1-30x-EF

SRR922409 O2-10X

Saccharomyces cerevisiae W303

SRP055987 7.5

SRR567755 O2-20X

SRR567755 O2-30X

SRR567755 O2-30X-EF

SRR567755 [ G L ] genome length; [Error Rate] ( + + ) / . RUNNING ERROR CORRECTION TOOLS

Running Illumina Read Error-Correction Tools

The input read sets were corrected using the 17 error-correction tools that had shown good accuracy in the previous evaluations or had been newly published at the time of running the evaluations. Among these, the stand-alone error correction tools are BFC [49], BLESS [6], Blue [7], Coral [8], HiTEC [9], Fiona[5], Lighter [10], Musket [11], Quake [12], QuorUM [13], RACER [14], Reptile [15],Trowel [16] and ECHO [17]. The remaining three tools are parts of DNA assemblers, ALLPATHS-LG [20], SGA [21], and SOAPdenovo [22]. For each error-correction method, we applied successive numbers to the key parameters of the tools, and generated multiple corrected output read sets corresponding to each parameter. The output read sets were assessed using SPECTACLE and the one that had the highest gain for substitutions, insertions, and deletions was chosen. The maximum k -mer length for Quake was limited to 18 beyond which the memory capacity of our server was exhausted. ALLPATHS-LG, BFC, BLESS, Blue, Musket, Quake, QuorUM, RACER, Reptile, SGA, and SOAPec succeeded in generating outputs for all the input read sets. Coral, HiTEC, Fiona, and Trowel failed to correct errors in large genomes because of insufficient memory. ECHO had not finished after 70 hours for the I4 and I5 read sets. Lighter finished correcting all the read sets but it made no correction for the read sets with 10 X coverage. Even though SPECTACLE can assess the outputs from error correction tools for RNA sequencing reads, the evaluation results for such tools have been excluded from the main document. SEECER [24] is the only software available for correcting RNA sequencing reads; for some input parameters, however, SEECER would occasionally terminate abnormally. So we ran SPECTACLE with SEECER with the parameters for which the tool had completed execution, and did not perform a parameter-space exploration. The results are in the supplementary document. Running TGS Read Error-Correction Tools

Widely used PacBio read error correction tools LoRDEC [28], LSC [29], PBcR [30], and Proovread [31] were evaluated using P1 and P2. No parameter tuning was needed for LSC, PBcR, and Proovread. For LoRDEC, we generated multiple output sets by applying successive values for k -mer length and solid k -mer occurrence threshold, and chose the result that gave the highest percentage similarity explained in the method section. We could not assess LSC using P2 because it had not finished after 70 hours. Since ONT is a relatively new technology, ONT read error correction technologies are just being explored and studied in detail. We evaluated two of the most recent ONT read error correction technologies NanoCorr [33] and NaS [32] using O1 and O2. Default parameters were used for each of the error-correction methods.

EVALUATION RESULTS FOR ILLUMINA ERROR CORRECTION TOOLS

Accuracy of Illumina Error Correction Tools Table 5.

Sensitivity and gain of substitution errors for the 40 X Illumina read sets.

Software

I1-40X

I2-40X

I3-40X

I4-40X

I5-40X I6 Sens.

Gain

Sens.

Gain

Sens.

Gain

Sens.

Gain

Sens.

Gain

Sens.

Gain

ALLPATHS-LG

BFC

BLESS

Blue

Coral

N/A

ECHO

N/A

Fiona

N/A

HiTEC

N/A

Lighter

Musket

Quake

QuorUM

RACER -0.097

Reptile

SGA

SOAPec

Trowel

N/A

N/A [Sens.] sensitivity.

Sensitivity and gain for substitution errors for the 40 X input read sets are summarized in Table 5. For all the bacterium genomes I1, I2, and I3, ALLPATHS-LG, BLESS, Lighter, Musket, Quake, QuorUM, and SGA generated outputs with gain above 0.95. For the highly repetitive genome I4, BLESS and Quake outperformed the others, and only these two tools obtained gain above 0.8. For I5, the largest input genome, ALLPATHS-LG, BFC, BLESS, Lighter, Musket, Quake, QuorUM, and SGA showed gain above 0.9. Other than BFC, these are the same tools that worked well for I1-I3. In the evaluation using I6, most tools showed similar performance as they did for I2 since both I2 and I6 were generated from B . cereus . However, Coral, Quake, Reptile, SOAPec, and Trowel showed a degradation of above 0.1 for the gain value in I6 when compared with I2. The difference between sensitivity and gain shows how many false corrections were made by each tool. In general, BFC, BLESS, Quake, SGA, and SOAPec generated fewer false corrections than the others. Table 6.

Sensitivity and gain of substitution errors for the I5 read sets with different coverage values.

Software

I5-10X

I5-20X

I5-30X

I5-40X

Sensitivity

Gain

Sensitivity

Gain

Sensitivity

Gain

Sensitivity

Gain

ALLPATHS-LG

BFC

BLESS

Blue

Fiona

N/A

Lighter

N/A

Musket

Quake

QuorUM

RACER -2.287 -0.164

Reptile

SGA

SOAPec

Table 6 shows the variation in gain with different read coverage values for I5. Only BLESS, Musket, and Quake generated gain above 0.85 for all the read sets. Lighter showed good results for 20-40 X reads, but it could not correct the errors in I5-10X. BFC, BLESS, Musket, Quake, SGA, and SOAPec made a small number of false corrections for low coverage read sets. Gain was saturated in most tools when read coverage became 30 X. Figure 3.

The percentage of corrected errors in I5-40X for various supporting read coverage of correct bases.

The percentage of corrected bases as a function of supporting read coverage for I5-40X is shown in Figure 3. ALLPATHS-LG, Quake, and QuorUM corrected more errors than the others when supporting read coverage of correct bases was close to 1. Even though ALLPATHS-LG and QuorUM have the capability to correct errors with low supporting read coverage, gain for I5-10X of the tools in Table 6 was not as impressive as this result. This is because they also generated many false positives for this input set. The effect of differential supporting read coverage on error correction was significant only when read coverage was low. These results can be found in Figure S4 of the supplementary document. As shown in Figure 4, tools can correct different percentages of errors in different locations in reads. The plots for ALLPATHS-LG, BFC, BLESS, and Lighter show relatively flat lines, which means that they corrected almost the same proportion of errors in all the positions in reads. On the other hand, plots for QuorUM and SGA have deep valley points, and the positions of these regions with little correction match with the k -mer length used with these tools for generating the respective outputs. In addition, Quake could only correct a relatively small number of errors at both ends of reads compared to the others. Figure 4.

Point sensitivity of the I5-40X reads.

We also analyzed gain with respect to insertions and deletions and the results can be found in the supplementary document.

Alignment Results for Illumina Error Correction Tools

Table 7 shows how many corrected reads can be exactly aligned to the reference sequences. Reads were aligned using the paired-end alignment feature of Bowtie [50] without allowing any mismatches or indels. The genomes I1-I5 have two reference sequences, and corrected read sets were aligned to the reference sequence from which they originated. The alignment results are well matched with the results in Table 5, and the tools that showed high sensitivity also had more reads aligned correctly to the reference sequences. Table 7.

Alignment results of 40 X Illumina read sets.

Software

I1-40X

I3-40X

I4-40X

I5-40X I6 Aligned

Correct

Aligned

Correct

Aligned

Correct

Aligned

Correct

Aligned

Correct

Aligned

Correct

Original

ALLPATHS-LG

BFC

BLESS

Blue

Coral

N/A

ECHO

N/A

Fiona

N/A

HiTEC

N/A

Lighter

Musket

Quake

QuorUM

RACER

Reptile

SGA [Aligned] the percentage of aligned reads to the total number of reads; [Correct]: the ratio of reads that were aligned to correct positions as a percentage of the total number of aligned reads; Original: pre-correction results.

In almost all the cases, the ratio of correctly aligned reads to the total number of aligned reads was over 99 percent with the exception of I4. For I4, only the corrected reads from BLESS, Lighter, and Racer showed the accuracy of over 99 percent.

Runtime and Memory Usage of Illumina Error Correction Tools

To compare how the run time and memory usage of various tools scale with the size of the input, we compared each Illumina error-correction method for two cases, I5-20X and I5-40X which has twice the number of reads as I5-20X. These results are summarized in Table 8. Except Reptile, all the evaluated Illumina error-correction tools support parallelization, and 12 threads were used for the tools. In addition to running parallel threads on a single node, BLESS can also be parallelized across multiple nodes using MPI. BLESS ’s results on two computing nodes are reported separately. For I5-40X, BLESS, Lighter, and SGA could correct the read set using under 4 GB of memory. BFC, BLESS, Blue, Lighter, QuorUM, and RACER used almost the same memory for both 20 X and 40 X coverage reads. The fastest tools were BLESS and Lighter and they were over 13 times faster than ALLPATHS-LG. ALLPATHS-LG required 3.6 times longer time for correcting I5-40X than I5-20X. Table 8.

Memory usage and runtime of Illumina error correction tools for I5-20X and I5-40X.

Software

Memory Usage (MB)

Runtime (min)

I5-20X

I5-40X

I5-20X

I5-40X

ALLPATHS-LG

BFC BLESS (1 node) BLESS (2 nodes) Blue Lighter Musket Quake QuorUM RACER Reptile

SGA SOAPec Effect of Using Different Alignment Tools on the Evaluation of Real Reads

For real reads, we compare the errors corrected by an error-correction tool against mismatches and indels obtained in aligning the reads to a reference sequence. Therefore, the numbers and the locations of errors could vary according to alignment tools. We generated two F L files from I6 using BWA [43] and Bowtie 2 [51] with default options, and the two files were compared. While BWA found 473,090 substitution errors in D6, Bowtie 2 found 632,705. About 97 percent of substitutions in the BWA set were also found in the Bowtie 2 set, which means Bowtie 2 is more aggressive than BWA and it could indicate more errors in reads. When the error-correction results were evaluated using the F L file from Bowtie 2, sensitivity and gain dropped by up to 8 percent compared to the results with the F L file from BWA because some of the new errors found by Bowtie 2 were not corrected in the error correction tools. Detailed results can be found in the supplementary document.

EVALUATION RESULTS FOR TGS ERROR-CORRECTION TOOLS

Due to the high error rate of TGS reads, error correction outputs could have many uncorrected bases. Therefore, most TGS error-correction tools generate two types of reads: 1) trimmed reads that only contain corrected regions in input reads and 2) untrimmed reads that include both corrected and uncorrected regions in input reads. For PacBio reads, PBcR only produces trimmed reads, LSC and Proovread generate both trimmed reads and untrimmed reads, and they were assessed separately. For LoRDEC, trimmed reads were generated from the untrimmed reads using lordec-trim-split that is included in the LoRDEC package. For MinION reads, both NanoCorr and NaS produce trimmed reads.

Accuracy of PacBio Error Correction Tools

In Figure 5A, percentage similarity of the outputs from PacBio read error-correction methods for P1 are compared. Percent similarity of the input reads was 76.6% before error correction, and all the output results were better than this number. Among the four tools, three tools except LSC showed percentage similarity over 95% for the trimmed reads. For the untrimmed reads, LoRDEC and Proovread generated more accurate reads than LSC. Except for the case of untrimmed LoRDEC reads, read coverage of Illumina reads had almost no impact on percentage similarity. Figure 5.

Percentage similarity, read coverage, and NG50 of PacBio read error correction methods for P1. [Original] results for the input PacBio reads before error correction.

Figure 5B and Figure 5C show read coverage and NG50 of the outputs of the compared tools. The two charts had similar shapes. Both values were high where percentage similarity in Figure 5A was low. The trimmed LoRDEC reads and the PBcR outputs were improved a lot by increasing Illumina read coverage. The trimmed reads from Proovread were also improved but the values were saturated at 30 X coverage. Figure 6.

Percentage similarity, read coverage, and NG50 of PacBio read error correction methods for P2. Percentage similarity, read coverage, and NG50 of the input PacBio reads before error correction were 79.4 percent, 20 X, and 12,095 bp, respectively.

Percentage similarity, read coverage, and NG50 are compared for P2-40X and P2-40X-EF that is the error-free version of P2-40X in Figure 6. Both the trimmed Proovread reads and the trimmed LoRDEC reads showed high percentage similarity. Percentage similarity and read coverage of the untrimmed Proovread reads were almost the same compared to those of the trimmed Proovread reads. However, NG50 of the trimmed Proovread reads was shorter than that of the untrimmed Proovread reads. LoRDEC generated the trimmed reads with high percent similarity but it removed too many bases and read coverage and NG50 of the read set became much lower than those of the original input reads. For all these cases, P2-40-EF did not make a meaningful difference when it was compared with P2-40. This means sequencing errors in Illumina reads are not important when Illumina read coverage is about 40 X. F igure 7. Sensitivity and gain of Pacbio reads.

Figure 7A and 7B show the sensitivity and gain results for the different PacBio error-correction tools. These results include both indel and substitution errors. Compared to tools correcting Illumina reads, the PacBio error-correction tools seem to have lower sensitivity and gain. Similar to percentage similarity for the same dataset, gain and sensitivity generally improve upon trimming. For example, it may be seen that the sensitivity of trimmed reads of LORDEC, Proovread and LSC are significantly higher than that of untrimmed versions. For LORDEC trimmed reads, though sensitivity increases with higher Illumina coverage, gain remains largely unchanged indicating that at higher Illumina coverage, more errors are corrected, but also there are more false corrections. Accuracy of ONT Read Error Correction Tools

Figure 8.

Percentage similarity and NG50 lengths of input MinION reads.

Figure 8A shows percentage read similarity values for ONT datasets. For O1, the original read similarity was 57.3%, which is lower compared to the corresponding PacBio reads (comparing P1 and O1 from the E.Coli genome). Both the error-correction tools significantly improved the percentage similarity of reads, and the values did not significantly change with different coverage values of the corresponding Illumina datasets. Figure 8B shows the NG50 values for O1 and O2 datasets. Both tool outputs have a lower NG50 length than the input reads and NaS reads have a noticeably lower NG50 length compared to NanoCorr for the O2 dataset. Using error-free Illumina reads did not bring in a noticeable improvement in error correction. Figures 9A and 9B summarize the sensitivity and gain for the two ONT datasets. These results include both indel and substitution errors. It may be noted that NaS presents slightly higher sensitivity and gain compared to NanoCorr in both cases. Figure 9.

Sensitivity and gain of MinION reads.

CONCLUSIONS AND DISCUSSION

Among the Illumina read error correction methods that were evaluated, ALLPATHS-LG, BFC, BLESS, Lighter, Quake, QuorUM, and SGA generated accurate results for over 30 X read coverage. BLESS and Quake outperformed the others for reads with 10-20 X read coverage, and it is expected that ALLPATHS-LG would work best for the reads with under 10 X read coverage when we see Figure 3. For highly repetitive genomes, it is recommended to use BLESS and Quake for getting the most accurate results. Among the evaluated PacBio error correction tools, there was no apparent winner that could generate both accurate and long reads. While trimming improved the accuracy of correction significantly, it reduced the NG50 length and read coverage appreciably. Proovread may be recommended in cases where the accuracy of corrected reads is more important than their length. If a large read set should be corrected in a short time, LoRDEC might be a good choice. Both tools evaluated for ONT reads provided outputs with good percentage similarity and had comparable gain and sensitivity with respect to the PacBio tools. NanoCorr had longer NG50 length for one of the datasets. ONT sequencing techniques have a higher error rate compared to PacBio sequencing techniques. However, the evaluated tools seem to improve the accuracy of ONT reads significantly. Given that ONT sequencing is cheaper, easy to use and provides higher data throughput, further improvements to existing error-correction methods for ONT reads as well as developing read simulation techniques to support the evaluation of these tools can make it a very attractive alternative. Though some tools have recommendations for choosing input parameters, we tried to tune the parameters independently based on the results for fair comparison. However, in a real situation where the locations of errors are not known in advance, it would not be possible to find the best parameters this way. Therefore, it is recommended to developers of error-correction tools that as many parameters as possible should be automatically determined or clear guidelines for determining them should be given to users. We believe that SPECTACLE will also be compatible with new sequencing technologies and some of its potential is evident from the fact that it can work with NGS and TGS reads with varied characteristics, providing a comprehensive set of evaluation metrics. The fundamental strength of the tool is that the underlying evaluation algorithms are not tied to specific read lengths or error models. Even though the work presents a comprehensive analysis of the accuracy of most of the state-of-the-art error correction methods for the NGS and TGS technologies, the study can be further extended to evaluate TGS reads from larger genomes using more powerful computational resources. It is expected that repeats in a genome would affect the quality of error correction in TGS reads. However, repeat effects will play a significant role in correction only when genome length is sufficiently long. We tried using a PacBio sequence of the 10 Mbp regions of mouse chromosome Y but their results were not worse than those collected for P2 corrected by LoRDEC and Proovread. These results were not included in the main text because the genome was too short for us to reach useful conclusion regarding repeats. The results are in the supplementary document. It is also desirable to study how sequencing errors degrade the quality of downstream analysis pipelines like variant calling. A detailed understanding of the mechanism will yield useful insights into how to correct errors that are detrimental to a specific application and how to make applications less sensitive to sequencing errors, and SPECTACLE can help in categorizing such errors. SOFTWARE AVAILABILITY

SPECTACLE and all the input reads and run scripts used in these experiments are available at https://github.com/gowthami19m/SPECTACLE.

SUPPLEMENTARY DATA

Supplementary data are available online at https://github.com/gowthami19m/SPECTACLE.

KEY POINTS • Correcting errors in high throughput sequencing data can improve the quality of downstream analyses. • We developed a software package that can evaluate the error correction tools for DNA/RNA reads and NGS/TGS reads, and compared 23 state-of-the-art error correction methods. • Recommended tools for Illumina reads are ALLPATHS-LG, BFC, BLESS, Lighter, Quake, QuorUM, and SGA in general, BLESS and Quake for repetitive genomes, and ALLPATHS-LG for reads with under 10 X coverage. • Recommended tools for PacBio reads are Proovread for accurate error correction and LoRDEC for scalability. There is a significant trade-off between output read-length and output read accuracy in the case of PacBio error correction. • Though ONT sequencing reads have a higher error rate than PacBio reads, the two ONT tools improve the accuracy significantly and provide output reads with quality comparable to PacBio error corrected reads. 37

BIOGRAPHY

Yun Heo completed his PhD at the University of Illinois at Urbana-Champaign from the department of Electrical and Computer Engineering. Gowthami Manikandan is a Master’s student in the department of Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign. Anand Ramachandran is a PhD student in the department of Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign. Deming Chen is a Professor in the department of Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign.

REFERENCES

1. Salzberg S, Phillippy A, Zimin A et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms, Genome Research 2012;22:557-567. 2. MacManes MD, Eisen MB. Improving transcriptome assembly through error correction of high-throughput sequence reads, PeerJ 2013;1:e113. 3. Fujimoto MS, Bodily PM, Okuda N et al. Effects of error-correction of heterozygous next-generation sequencing data, BMC Bioinformatics 2014;15:S3. 4. Li H. BFC: correcting Illumina sequencing errors, Bioinformatics 2015:btv290. 5. Schulz MH, Weese D, Holtgrewe M et al. Fiona: a parallel and automatic strategy for read error correction, Bioinformatics 2014;30:i356-363. 6. Heo Y, Wu X-L, Chen D et al. BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads, Bioinformatics 2014;30:1354-1362. 7. Greenfield P, Duesing K, Papanicolaou A et al. Blue: correcting sequencing errors using consensus and context, Bioinformatics 2014:btu368. 8. Salmela L, Schröder J. Correcting errors in short reads by multiple alignments, Bioinformatics 2011;27:1455-1461. 9. Ilie L, Fazayeli F, Ilie S. HiTEC: accurate error correction in high-throughput sequencing data, Bioinformatics 2011;27:295-302. 10. Song L, Florea L, Langmead B. Lighter: fast and memory-efficient sequencing error correction without counting, Genome Biol 2014;15:509. 11. Liu Y, Schröder J, Schmidt B. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data, Bioinformatics 2013;29:308-315. 12. Kelley D, Schatz M, Salzberg S. Quake: quality-aware detection and correction of sequencing errors, Genome Biology 2010;11:R116. 13. Marçais G, Yorke JA, Zimin A. QuorUM: an error corrector for Illumina reads, arXiv preprint arXiv:1307.3515 2013. 38 14. Ilie L, Molnar M. RACER: Rapid and accurate correction of errors in reads, Bioinformatics 2013;29:2490-2493. 15. Yang X, Dorman K, Aluru S. Reptile: representative tiling for short read error correction, Bioinformatics 2010;26:2526-2533. 16. Lim E-C, Müller J, Hagmann J et al. Trowel: a fast and accurate error correction module for Illumina sequencing reads, Bioinformatics 2014;30:3264-3265. 17. Kao W-C, Chan A, Song Y. ECHO: A reference-free short-read error correction algorithm, Genome Research 2011;21:1181-1192. 18. Wirawan A, Harris RS, Liu Y et al. HECTOR: a parallel multistage homopolymer spectrum based error corrector for 454 sequencing data, BMC Bioinformatics 2014;15:131. 19. Schröder J, Schröder H, Puglisi SJ et al. SHREC: a short-read error correction method, Bioinformatics 2009;25:2157-2163. 20. Gnerre S, MacCallum I, Przybylski D et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data, Proceedings of the National Academy of Sciences 2011;108:1513-1518. 21. Simpson J, Durbin R. Efficient de novo assembly of large genomes using compressed data structures, Genome Research 2011;22:gr.126953.126111-126556. 22. Luo R, Liu B, Xie Y et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler, GigaScience 2012;1:18. 23. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics 2009;10:57-63. 24. Le H-S, Schulz M, McCauley B et al. Probabilistic error correction for RNA sequencing, Nucleic Acids Research 2013;41:e109-e109. 25. Schadt EE, Turner S, Kasarskis A. A window into third-generation sequencing, Human molecular genetics 2010;19:R227-R240. 26. Quail M, Smith M, Coupland P et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers, BMC Genomics 2012;13:341. 27.

T. Laver, J. Harrison, P.A. O’Neill, K. Moore, A. Farbos, K. Paszkiewicz, et al. Assessing the performance of the Oxford Nanopore Technologies MinION. Biomol Detect Quantif, 3 (2015), pp. 1 –

28. Salmela L, Rivals E. LoRDEC: accurate and efficient long read error correction, Bioinformatics 2014:btu538. 29. Au KF, Underwood JG, Lee L et al. Improving PacBio long read accuracy by short read alignment, PLoS ONE 2012;7:e46679. 30. Koren S, Schatz M, Walenz B et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads, Nature Biotechnology 2012;30:693-700. 31. Hackl T, Hedrich R, Schultz J et al. proovread: large-scale high-accuracy PacBio correction through iterative short read consensus, Bioinformatics 2014;30:3004-3011. 32.