Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm
Can Firtina, Jeremie S. Kim, Mohammed Alser, Damla Senol Cali, A. Ercument Cicek, Can Alkan, Onur Mutlu
BBioinformatics doi.10.1093/bioinformatics/xxxxxxAdvance Access Publication Date: Day Month YearManuscript Category
Sequence analysis
Apollo: A Sequencing-Technology-Independent,Scalable, and Accurate Assembly PolishingAlgorithm
Can Firtina , Jeremie S. Kim , Mohammed Alser , Damla Senol Cali ,A. Ercument Cicek , Can Alkan , ∗ , and Onur Mutlu , , , ∗ Department of Computer Science, ETH Zurich, Zurich 8092, Switzerland Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh 15213, PA, USA Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey ∗ To whom correspondence should be addressed.
Associate Editor: XXXXXXX
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Abstract
Motivation:
Third-generation sequencing technologies can sequence long reads that contain as many as2 million base pairs (bp). These long reads are used to construct an assembly (i.e., the subject’s genome),which is further used in downstream genome analysis. Unfortunately, third-generation sequencingtechnologies have high sequencing error rates and a large proportion of bps in these long reads are incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis.
Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assemblyby using information from alignments between reads and the assembly (i.e., read-to-assembly alignmentinformation). However, current assembly polishing algorithms can only polish an assembly using readseither from a certain sequencing technology or from a small assembly. Such technology-dependency andassembly-size dependency require researchers to 1) run multiple polishing algorithms and 2) use smallchunks of a large genome to use all available read sets and polish large genomes, respectively.
Results:
We introduce Apollo, a universal assembly polishing algorithm that scales well to polish anassembly of any size (i.e., both large and small genomes) using reads from all sequencing technologies(i.e., second- and third-generation). Our goal is to provide a single algorithm that uses read sets fromall available sequencing technologies to improve the accuracy of assembly polishing and that can polishlarge genomes. Apollo 1) models an assembly as a profile hidden Markov model (pHMM), 2) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm, and 3) decodes the trainedmodel with the Viterbi algorithm to produce a polished assembly. Our experiments with real read setsdemonstrate that Apollo is the only algorithm that 1) uses reads from any sequencing technology within asingle run and 2) scales well to polish large assemblies without splitting the assembly into multiple parts.
Contact Authors: [email protected], [email protected]
Supplementary information:
Supplementary data is available at
Bioinformatics online. online.
Availability:
Source code is available at https://github.com/CMU-SAFARI/Apollo
High-Throughput Sequencing (HTS) technologies are being widelyused in genomics due to their ability to produce a large amount ofsequencing data at a relatively low cost compared to first-generationsequencing methods (Sanger et al. , 1977). Despite these advantages, HTStechnologies have two significant limitations. The first limitation is thatHTS technologies can only sequence fragments of the genome (i.e., reads ).This results in the need to reconstruct the original full sequence by eitherusing 1) read alignment, the process of aligning the reads to a referencegenome , a genome representative of all individuals within a species, or 2) de novo genome assembly , the process of aligning all reads against eachother to construct larger fragments called contigs , by identifying reads thatoverlap and combining them. The second limitation of HTS technologiesis that they introduce non-negligible insertion, deletion, and substitutionerrors (i.e., ∼
10 - 15% error rate) into reads. Depending on the method forreconstructingtheoriginalsequence, HTSerrorsoftencauseeither1)readsaligned to an incorrect location in the reference genome, or 2) erroneously constructed assemblies. These two limitations of HTS technologiesare partially mitigated with computationally expensive algorithms suchas alignment and assembly construction . Despite the wide availabilityof these algorithms, imperfect sequencing technologies still affect thereliability of downstream analysis in the genome analysis pipeline (e.g.,variant calling).Based on the average read length and the error profile of their reads,HTS technologies are roughly categorized into two types: (1) second-generation and (2) third-generation sequencing technologies. Second-generation sequencing technologies (e.g., Illumina) generate the mostaccurate reads ( ∼ ∼ et al. , 2013; Alser et al. , 2017; Kim et al. , 2018; Alser et al. , 2019a,b).Aligners must either deterministically select a matching location, whichrequires additional computation, or randomly select one of the candidatelocations, which results in non-reproducible read alignments (Firtinaand Alkan, 2016). In de novo genome assembly, high computational a r X i v : . [ q - b i o . GN ] M a r Firtina et al. complexity is required to identify overlaps between reads. Even aftercompleting de novo genome assembly, there are often multiple gaps inan assembly (Meltz Steinberg et al. , 2017). This means an assembly iscomposed of many smaller contigs rather than a few long contigs, or inthe ideal case, a single genome-sized contig.Third-generation sequencing technologies (i.e., PacBio’s SingleMolecule Real-Time (SMRT) and Oxford Nanopore Technologies (ONT))are capable of producing long reads ( ∼ ∼
10 - 15% error rate) (Huddleston et al. , 2014; Jain et al. , 2018; Payne et al. , 2018). Different third-generation sequencing technologies result in different error profiles. Forexample, PacBio reads tend to have more insertion errors than othererror types whereas insertion errors are the least common errors for ONTreads (Weirather et al. , 2017). Long reads make it more likely to find longeroverlaps between the reads in de novo genome assembly. As a result,there are usually fewer long contigs (Alkan et al. , 2011; Chaisson et al. ,2015; Meltz Steinberg et al. , 2017). Despite this, error-prone reads oftenresult in a highly erroneous assembly, which may not be representativeof the subject’s actual genome. As a consequence, any analysis using theerroneous assembly (e.g., identifying variations/mutations in a subject’sgenome to determine proclivity for diseases) is often unreliable.Existing solutions that try to overcome the problem of error-proneassemblies when using de novo genome assembly can be categorizedinto two types. First, a typical solution is to correct the errors of longreads. Errors are corrected by using high coverage reads (e.g., ∼ × coverage) from the same sequencing technology (i.e., self-correction)or additional reads from more reliable second-generation sequencingtechnologies (i.e., hybrid correction). There are several available errorcorrection algorithms that use additional reads to locate and correct errorsin long reads (e.g., Hercules (Firtina et al. , 2018), LoRDEC (Salmelaand Rivals, 2014), LSC (Au et al. , 2012), and LoRMA (Salmela et al. ,2016)). The main disadvantage of error correction algorithms is that theyrequire more sequenced reads from either the same or different sequencingtechnologies. For example, LoRMA, a self-correction tool, uses reads tobuild a de Bruijn graph for error correction. The reads corrected using ade Bruijn graph method cannot span even half of the entire genome, if thecoverage is lower than 100 × (Salmela et al. , 2016). When the coverageis low, the connections in a de Bruijn graph can be weak. These weakregions can be treated as bulges and tips, and can be removed from thegraph (Chaisson et al. , 2004), which may fail to create a reliable consensusof the entire genome for error correction. Although hybrid correction tools(e.g., PBcR (Koren et al. , 2012)) can use low coverage short reads (e.g.,25 × ) to correct the long reads that can span 95% of the genome aftercorrection, these hybrid correction tools require additional short reads.Therefore, in both cases (i.e., hybrid and self-correction), generatingadditional reads (i,e., either additional short reads or high coverage longreads) requires additional cost and time. While a higher-coverage datasetmay lead to higher read accuracy (Berlin et al. , 2015), the cost of producinga high-coverage dataset for long reads is often prohibitively high (Rhoadsand Au, 2015). For example, sequencing the human genome with ONTat only 30 × coverage costs around $36,000 (Jain et al. , 2018). Unlessthere exist sufficient resources for multiple sequencing technologies orhigh-coverage, error correction algorithms may not be a viable option togenerate accurate assemblies.The second method for removing errors in an assembly is called assembly polishing . An assembly polishing process attempts to correct theerrors of the assembly using the alignments of either long or short readsto the assembly. The read-to-assembly alignment, which is the alignmentof the reads to the assembly, allows an assembly polishing algorithm todecide whether the assembly should be polished based on the similarity ofthe base pairs between the alignments of the reads and their correspondinglocations in the assembly. If the assembly polishing algorithm finds adissimilarity, the algorithm modifies the assembly to make it more similarto the aligned reads as it assumes that the alignment information is a morereliable source. In other words, the dissimilarity is attributed to errors in theassembly. Assembly polishing algorithms assume that such modificationscorrect, or polish, the errors of an assembly.There are various assembly polishing algorithms that use variousmethods for discovering dissimilarities and modifying the assembly (e.g.,Nanopolish (Loman et al. , 2015), Racon (Vaser et al. , 2017), Quiver (Chin et al. , 2013), and Pilon (Walker et al. , 2014)). However, the primarylimitation of many of these assembly polishing algorithms is that they workonlywithreadsfromalimitedsetofsequencingtechnologies. Forexample,Nanopolish can use only ONT long reads (Senol Cali et al. , 2019), whileQuiver supports only
PacBio long reads. Thus, these assembly polishing algorithms are sequencing-technology-dependent. Even though Pilon canuse long reads as it does not impose a hard restriction not to use them, Pilondoes not suggest using long reads, and it is well tuned for using short reads.Therefore, we consider Pilon as only a partially-sequencing-technology-independent algorithm as it neither prevents nor truly supports using longreads. Even though Racon can use either short or long reads to polish anassembly, it can use only a single set of reads within a single run (e.g.,only a set of PacBio reads). This requires an assembly to be polished inmultiple runs with Racon to use all the available set of reads from multiplesequencing technologies (i.e., a hybrid set of reads ). There is currentlyno single assembly polishing algorithm that can polish an assembly withan arbitrary set of reads from various sequencing technologies (e.g., bothONT and PacBio reads) within a single run.Whilethetechnology-dependencyproblemofsuchassemblypolishingalgorithms could be mitigated by consecutively using either differentalgorithms (e.g., Quiver and Pilon) or the same algorithm multiple times(e.g., runningRacontwicetousebothPacBioandIlluminareads), therearescalability problems associated with using polishing algorithms to polisha large genome and, therefore, running assembly polishing algorithmsmultiple times for two reasons. First, none of the polishing algorithms canscale well to polish large genomes within a single run as they require largecomputational resources (e.g., polishing a human genome requires morethan 192GB of available memory) unless the coverage of a set of reads islow (e.g., less than 10 × ). Therefore, these assembly polishing algorithms cannot polish large genomes in a single run if the available computationalresources are not tremendous, and they are restricted to polish smaller parts(e.g., contigs) of a large genome. Second, dividing a large genome intosmaller contigs and running polishing algorithms multiple times requiresextra effort to collect and merge the multiple results to produce the polishedlarge genome assembly as a whole.A universaltechnology-independentassemblypolishingalgorithm thatcan use reads regardless of both 1) the sequencing technology used toproduce them and 2) the size of the genome, enables the usage of allavailable reads for a more accurate assembly compared to using readsfrom a single sequencing technology. Such a universal assembly polishingalgorithm would also not require running assembly polishing multipletimes to take advantage of all available reads. Unfortunately, such anassembly polishing algorithm does not exist.Our goal in this paper is to propose a technology-independent assemblypolishing algorithm that enables all available reads to contribute toassembly polishing and that scales well to polish an assembly of anysize (e.g., both small and large genome assemblies) within a single run.To this end, we propose a machine learning-based universal technology-independent assembly polishing algorithm, Apollo, that corrects errorsin an assembly by using read-to-assembly alignment regardless of thesequencing technology used to generate reads. Apollo is the first universaltechnology-independent assembly polishing algorithm. Apollo’s machinelearning algorithm is based on two key steps: (1) training and (2) decodingthe profile hidden Markov model (pHMM) of an assembly. First, Apollouses the Forward-Backward and Baum-Welch algorithms (Baum, 1972)to train the pHMM by calculating the probability of the errors based onaligned reads. Error probabilities in the pHMM reveal how reads andthe assembly that the reads align to are similar to each other withoutmaking any assumptions on the sequencing technology used to produce thereads. This is the key feature that makes Apollo sequencing-technology-independent. Second, Apollo uses the Viterbi algorithm (Viterbi, 1967)to decode the trained pHMM to correct the errors of an assembly.Apollo employs a recent pHMM design (Firtina et al. , 2018), as thisdesignaddressesthecomputationalproblemsthatmakepHMMsotherwiseimpractical to use for training in machine learning. The design of thepHMM enables flexibility in adapting the pHMM based on the errorprofileoftheunderlyingsequencingtechnologyofanassembly. Therefore,Apollo can additionally apply the known error profile of a sequencingtechnology to improve upon its error probability calculations.We compare Apollo with Nanopolish, Racon, Quiver, and Pilon usingdatasetsthataresequencedwithdifferenttechnologies: EscherichiacoliK-12 MG1655 (MinION and Illumina), Escherichia coli O157 (PacBio andIllumina), Escherichia coli O157:H7 (PacBio and Illumina), Yeast S288C(PacBio and Illumina), and the human Ashkenazim trio sample (HG002,PacBio and Illumina). We compare our polished assemblies against highlyaccurate and finished genome assemblies of the corresponding samples todetermine the accuracy of the various assembly polishing algorithms.Using the datasets from different sequencing technologies, we firstshow that Apollo scales better than other polishing algorithms in polishingassemblies of large genomes using moderate and high coverage reads. pollo Second, Apollo is the only algorithm that can use reads from multiple sequencing technologies in a hybrid manner (e.g., using both ONT andIllumina reads in a single run). Because of this, Apollo scales well topolish an assembly of any size within a single run using any set of reads,which makes Apollo a universal, sequencing-technology-independentassembly polishing algorithm. Third, we show that when Apollo usesa hybrid set of reads (i.e., both PacBio and Illumina reads), it polishesassemblies generated by Canu (Koren et al. , 2017) (i.e., Canu-generatedassemblies) more accurately than any other polishing algorithm. Fourth,for all other remaining cases, when we compare Apollo to other competingalgorithms, our experiments show that Apollo usually produces assembliesof similar accuracy to competing algorithms: Nanopolish, Pilon, Racon,and Quiver. However, when using long read sets to polish Miniasm-generated
E. coli
O157:H7,
E. coli
K-12, and Yeast S288C assemblies,Apollo produces assemblies with less accuracy than that of Racon andQuiver. These experiments are based on 1) a ground truth (i.e., reference-dependent comparison), 2) k-mer similarity calculation (i.e., Jaccardsimilarity (Niwattanakul et al. , 2013)) between an Illumina set of readsand a polished assembly, and 3) the quality assessment of the assemblyfrom mapped short reads (i.e., reference-independent comparison). ThesecomparisonsshowthatApollocanpolishanassemblyusingreadsfromanysequencing technology while still generating an assembly with accuracyusually comparable to the competing algorithms. Fifth, we use moderatelong read coverage datasets (e.g., 30 × ) and show that Apollo can produceaccurate assemblies even with a moderate read coverage. We conclude thatApollo is the first universal assembly polishing algorithm that 1) scaleswell to polish assemblies of both large and small genomes, and 2) can useboth long and short reads as well as a hybrid set of reads from varioussequencing technologies.This paper makes the following contributions: • We introduce Apollo, a new assembly polishing algorithm that canmake use of reads sequenced by any sequencing technology (e.g.,PacBio, ONT, Illumina reads). Apollo is the first assembly polishingalgorithm that 1) is scalable such that it can polish assemblies of bothlarge and small genomes, and 2) can polish an assembly with a hybridset of reads within a single run. • We show that using both long and short reads in a hybrid mannerto polish a Canu-generated assembly enables the construction ofassemblies more accurate than those constructed by running otherpolishing tools multiple times. • We show that four competing polishing algorithms cannot scale wellto polish assemblies of large genomes within a single run due to largecomputational resources that they require. • We provide an open source implementation of Apollo(https://github.com/CMU-SAFARI/Apollo).
Apollo builds, trains, and decodes a profile hidden Markov modelgraph (pHMM-graph) to polish an assembly (i.e., to correct the errorsof an assembly). Apollo performs assembly polishing using two inputpreparation steps that are external to Apollo (pre-processing) and threeinternal steps, as shown in Figure 1. The first two pre-processing stepsinvolve the use of external tools such as an assembler and an aligner to generate inputs for Apollo. First, an assembler uses reads (e.g., longreads) to generate assembly contigs (i.e., larger sequence fragments ofthe assembly). Second, an aligner aligns the reads used in the firststep and any additional reads (e.g., short reads) of the same sampleto the contigs to generate read-to-assembly alignment. Third, Apollouses the assembly generated in the first step to construct a pHMM-graph per contig. A pHMM-graph is comprised of states, transitionsbetween states, and probabilities that are associated with both states andtransitions to account for all possible error types. Examples of errors thata sequencing technology can introduce into a read are insertion, deletionand substitution errors (which we handle in this work), and chimeric errors(which we do not handle). Therefore, correction of these errors can beaccomplishedbydeleting, inserting, orsubstitutingthecorrespondingbasepair, respectively. Apollo identifies a path in the pHMM-graph such thatthe states that make the contig erroneous are excluded. Fourth, Apollouses the read-to-assembly alignment to update, or train, the initial ( prior )probabilities of the pHMM-graph with the Forward-Backward and Baum-Welch algorithms. During training, the Forward-Backward algorithm useseach read alignment to change the prior probabilities of the graph basedon the similarity between a read and the aligned region in the assembly. Fifth, Apollo implements the Viterbi algorithm to find the path in thepHMM-graph with the minimum error probability (i.e., decoding), whichcorresponds to the polished version of the corresponding contig.
An assembler takes a set of reads as input and identifies the overlapsbetween the reads in order to merge the overlapped regions into largerfragments called contigs. An assembler usually reports contigs in FASTAformat (Pearson and Lipman, 1988) where each element is comprisedof an ID and the full sequence of the contig. The entire collection ofcontigs represents the whole assembly. Apollo requires the assembly tobe constructed to correct the errors in each contig of the assembly. Thus,assembly generation is an external step to the assembly polishing pipelineof Apollo (Figure 1 Step 1). Apollo supports the use of any assembler thatcan produce the assembly in FASTA format (Pearson and Lipman, 1988),such as Canu (Koren et al. , 2017) and Miniasm (Li, 2016).
After assembly construction, the second external step is to generate theread-to-assembly alignment using 1) the reads that the assembler usedto construct the assembly and 2) any additional reads sequenced fromthe same sample (Figure 1 Step 2). It is possible to use any aligner thatcan produce the read-to-assembly alignment in SAM/BAM format (Li et al. , 2009) such as Minimap2 (Li, 2018) or BWA-MEM (Li and Durbin,2009). In the case where reads from multiple sequencing technologies areavailable for a given sample, an aligner aligns all reads to the assembly.Apollo assumes that the alignment file is coordinate sorted and indexed.Apollo uses the assembly and the read-to-assembly alignmentgenerated in the first two pre-processing steps in its assembly polishingsteps. The next three steps (Steps 3-5) are the assembly polishing stepsand implemented within Apollo.
The pHMM-graph that Apollo employs includes states that emit certaincharacters, directed transitions that connect a state to other states, andprobabilities associated with character emissions and state transitions. Thestate transition probability represents the likelihood of following a pathfrom a state to another state using the transitions connecting the states,and the character emission probability represents the likelihood for a stateto emit a certain base pair when the state is visited. These pHMM-graphelements enable a pHMM-graph to provide the probability of generating acertain sequence when a certain path of states is followed using the directedtransitions between the states.This probabilistic behavior of pHMM-graphs makes them a goodcandidate to resolve errors of an assembly. Apollo represents each contigof an assembly as a pHMM-graph. The complete structure of a pHMM-graph allows Apollo to handle three major types of errors: substitution,deletion, and insertion errors. First, Apollo represents each base pair of acontig as a state, called the match state . The pHMM-graph preserves thesequence order of the contig by inserting a directed match transition fromthe previous match state of a base pair to the next one. The match stateof a certain base pair has a predefined ( prior ) match emission probability for the corresponding base pair, and mismatch emission probability for thethree remaining possible base pairs (i.e., a substitution error). A matchstate handles the cases when there is no error in the corresponding basepair (i.e., emitting the base pair that already exists in the certain position),or when there is a substitution error (i.e., emitting a different base pair forthe certain position). Second, there are l many insertion states for eachbase pair in the contig where l is a parameter to Apollo, which defines themaximum number of additional base pairs that can be inserted betweentwo base pairs (i.e., two match states). An insertion state inserts a singlebase pair in the location it corresponds to (e.g., visiting two subsequentinsertion states after a match state inserts two base pairs between the twomatch states) in order to handle a deletion error . Last, each match andinsertion state has k many deletion transitions where k is also a parameterto Apollo, which defines the maximum number of contiguous base pairsthat can be deleted with a single transition. If there is an insertion error , adeletion transition skips the match states between a state (e.g., an insertionor a match state) to a match state in order to delete the corresponding basepairs of the skipped match states. Further details of the pHMM-graph canbe found in Supplementary Materials (Section 1).The pHMM-graph structure that Apollo uses is identical to the oneproposed in Hercules (Firtina et al. , 2018), a recently proposed errorcorrection algorithm that uses pHMM-graphs. The key difference isthat Apollo creates a graph for each contig whereas Hercules creates Firtina et al.
Fig. 1.
InputpreparationandthepipelineofApolloalgorithminfivesteps. ThefirsttwostepsrefertotheuseofexternaltoolstogeneratetheinputforApolloandarecalled inputpreparationsteps (left side). (Step 1) An assembler generates the assembly (dark gray, large rectangles) using erroneous reads (light blue rectangles). Here the errors are labeled with the red bars insidethe rectangles. (Step 2) An aligner aligns the reads used in the first step as well as additional reads to the assembly. Here we show the reads sequenced using different sequencing technologiesin different colors and sizes (e.g., a short rectangle indicates a short read) since it is possible to use any available read within a single run with Apollo. The rest of the three steps constitutethe new Apollo algorithm and are called Internal to Apollo (right side). (Step 3) Apollo creates a profile hidden Markov model graph (pHMM-graph) per assembly contig. Here, we show anexample for the pHMM-graph generated for the contig that starts with "
AGCACC " and ends with "
GCCT " as we show the original sequence below the states labeled with a base pair.Each base pair in a contig is represented by a state labeled with the corresponding base pair (i.e., match state). A pHMM graph also consists of insertion states for each base pair labeledwith green color as well as start and end states that do not correspond to any base pair in a contig. In this example, the maximum insertion that can be made between each base pair is two aswe have two insertion states per match state. Each transition or emission of a base pair from a state has a probability associated with it. For simplicity, we omit deletion transitions from thisgraph. (Step 4) The Forward-Backward algorithm trains the pHMM-graph and updates the transition and emission probabilities based on read-to-assembly alignments. (Step 5) Using theupdated probabilities, the Viterbi algorithm decodes the most likely path in the pHMM-graph and takes the path marked with the red transitions and states, which corresponds to the polishedassembly. We also show the corresponding corrections in red text color below the states. For each contig, the output of Apollo is the sequence of base pairs associated with the states in themost likely path. a graph for each read . As such, the pHMM-graph size in Apollo isusually larger than that in Hercules since contigs are typically longerthan reads. Therefore, Apollo uses additional techniques to handle largepHMM-graphs (e.g., dividing pHMM-graphs into smaller graphs withoutcompromising correction accuracy) during both training and decodingsteps, which has certain trade-offs with respect to implementation, as weexplain in Sections 2.4, 2.5, and 3.1.
ThetrainingstepofApollouseseachread-to-assemblyalignmenttoupdatetransition and emission probabilities of a contig’s pHMM-graph. Thepurpose of the training step is to make specific transitions and emissionsmore probable in a sub-graph of the pHMM-graph such that it will be morelikely to emit the entire read sequence for the region that the read alignsto. A sub-graph contains a subset of the states of a pHMM-graph and thetransitions connecting these states. Each difference between a contig andthe aligned read updates the probabilities so that it will be more likely toreflect the difference observed in the read. The calculations during trainingdo not make assumptions about the sequencing technology of the read butonly reflect the differences and similarities in the pHMM-graph. Thus,Apollo can update the sub-graph with any read aligned to the contig. Thismakes Apollo a sequencing-technology-independent algorithm.For each alignment to a contig, Apollo identifies the sub-graph thatthe read aligns to in the pHMM graph to update (train) the emission andtransition probabilities in the sub-graph. Apollo locates the start and endstates of the sub-graph to define its boundaries in the pHMM graph. First,Apollo identifies the start location of a read’s alignment in the contig andmarks the match state of the previous base pair as the start state . Second,Apollo estimates the location of the end state such that the number ofmatch states between the start state and the end state is longer than thelength of the aligned read (i.e., up to . % longer). This is to accountfor the case where there are more insertion errors than deletion errors. TheBackward calculation uses the end state as the initial point to calculate theprobabilities from backward as we explain later in this section. An accurateestimation of the end state is crucial as an inaccurate initial point for theBackwardcalculationmayleadtoinaccuratetraining. Theinsertionandthematch states between the start and the end states as well as the transitionsconnecting these states constitute the sub-graph of the aligned region.The sub-graphs that Apollo trains usually vary in size since the lengthof long reads (i.e., reads sequenced by the third-generation sequencingtechnologies) can fluctuate dramatically (e.g., from 15bps to 2Mbps)whereas the length of short reads is usually fixed (e.g., 100bps). As Apollopolishes the assembly using both short and long reads, the broad rangeof read lengths requires Apollo to be flexible in terms of defining the length of the sub-graph (i.e., the number of match states that the sub-graph includes) to train. This is a key difference in requirements betweenApollo and Hercules (Firtina et al. , 2018). Hercules defines the number ofmatch states to include in a sub-graph with a fixed ratio as the aligned readsare always short reads. However, Apollo is more flexible in the selectionof the region that a sub-graph covers since Apollo can use reads of anylength. Apollo decides whether the aligned read is short or long based onthe read length, of which we set the threshold at 500bps (i.e., if a read islonger than 500bps, it is considered as a long read). If the aligned readlength is short (i.e., shorter than 500bps), the sub-graph is . % longerthan the length of the short read. Otherwise, the sub-graph is % longerthan the length of the aligned long read (empirically chosen).Apollo uses the Forward-Backward and the Baum-Welch algorithms(Baum, 1972) to train the sub-graph that a read aligns to. The Forward-Backward algorithm takes the aligned read as an observation andupdates the emission and transition probabilities of the states in thesub-graph. There are three steps in the Forward-Backward algorithm:1) Forward calculation, 2) Backward calculation, and 3) training byupdating the probabilities (i.e., the expectation-maximization step usingthe Baum⣓Welch algorithm). First, Forward calculation visits eachpossible path from the start state up to but not including the end stateuntil each visited state emits a single base pair from the read starting fromthe first (i.e., leftmost) base pair. Therefore, the number of visited statesis equal to the length of the aligned read. Second, similar to Forwardcalculation, Backward calculation visits each possible path in a backwardfashion (i.e., from the last base pair to the first base pair) starting withthe state that the Forward calculation determines to be the most likelyuntil the start state. Third, the Forward-Backward algorithm updates thetransitions and emission probabilities based on how likely it is to take acertain transition or a state to emit a certain character. We refer to theupdated probabilities as posterior probabilities . In theory, the trainingstep known as the Baum–Welch algorithm (Baum, 1972) is separatedfrom the Forward-Backward calculations, as described in Section 3 ofSupplementary Materials. However, for the sake of simplicity, we assumethat the Forward-Backward step includes both the Forward-Backwardcalculations and the training step when we refer to it in the remainingpart of this paper. Apollo trains each sub-graph (i.e., each read alignment)independently even though the states and the transitions may overlapbetween the aligned reads. For overlaps, Apollo takes the average of theposterior transition and emission probabilities of the overlapping regions.Once Apollo trains each pHMM sub-graph using all the alignments to acontig, it completes the training phase for that contig. The trained pHMM-graph represents the polished version of the contig. Sections 2 and 3 in pollo the Supplementary Materials describe in detail how Apollo locates a sub-graph per read alignment and the training phase of the Forward-Backwardalgorithm. The last step in Apollo’s assembly polishing mechanism is the decodingof the trained pHMM-graph in order to extract the path with the highestprobability from the start of the graph to the end of the graph. Findingthe path with the highest probability reveals the consensus of the alignedreads to correct the contig. To identify this path, Apollo uses the Viterbialgorithm (Viterbi, 1967) on the trained pHMM-graph (Figure 1 Step 5).The Viterbi algorithm is a dynamic programming algorithm that finds themost likely backtrace from a certain state to the start state in a givengraph. Each Viterbi value represents how likely it is to be in a certain stateat a time t (i.e., position in the contig) and is stored in the correspondingcell in a table called a dynamic programming table (DP table). Thus, acomplete DP table reveals the most likely path of the entire pHMM-graphby backtracking the most likely path from the end state to the start state.The Viterbi algorithm computes each entry of the dynamicprogramming table using the Viterbi values of the previously visited states.This data dependency makes the Viterbi algorithm less suitable for multi-threading support, as it prevents calculating the Viterbi values of the entiregraph in parallel. Apollo overcomes this issue by dividing the pHMM-graph into sub-graphs (i.e., chunks), each of which includes a certainnumber of states. The Viterbi algorithm decodes each sub-graph (i.e.,finds the optimal path in a graph) and merges the decoding results intoone piece again. Since the Viterbi algorithm can decode each sub-graphindependently, this allows Apollo to parallelize the Viterbi algorithm. Wefind that our parallelization greatly speeds up the Viterbi algorithm, by ∼ × .Apollo follows a slightly different approach than the actual Viterbialgorithm when decoding a graph. The actual Viterbi algorithm uses anobservationprovidedasinput(i.e., asequenceofbasepairs)tocalculatetheViterbi values of states in the graph. For Apollo, there is no observationprovided as input. Apollo uses the base pair with the highest emissionprobability of a state as observation when calculating the Viterbi value ofthat state. For each state in the decoded path, Apollo outputs the base pairwith the highest probability, which corresponds to the polished contig.Apollo reports each polished contig as a read in FASTA format. Details ofthe Viterbi algorithm are in Supplementary Materials (Section 4).Note that Apollo can only polish contigs to which at least a single readaligns. Thus, Apollo reports an unpolished version of a contig, if there is noread aligned to it. In such cases, Apollo also reports the issue as output byinforming that a certain contig cannot be polished because there is no readaligned to the contig. After raising the issue, Apollo continues polishingthe remaining contigs, if any. We expect that such a case happens rarely.For example, a low coverage set of short reads may not be able to alignto a too small and erroneous contig constructed using long reads, whichwould leave the contig with no read aligned to it. Another example wouldbe having very similar regions (i.e., repetitive regions) in multiple contigssuch that reads can be assigned to only one of the contigs sharing a similarregion. Such a case may leave a contig without any read aligned to it sincethese reads may already be aligned to the similar regions in other contigs. We implemented Apollo in C++ using the SeqAn library (Döring et al. , 2008). The source code is available at https://github.com/CMU-SAFARI/Apollo. Apollo supports multi-threading.Our evaluation criteria include three different methods to assess thequality of the assemblies. First, we use the dnadiff tool provided underMUMmerpackage(Kurtz etal. ,2004)tocalculatetheaccuracyofpolishedassembliesbycomparingthemwiththehighly-accuratereferencegenomes(i.e., ground truth genomes). We report the percentage of bases of anassembly that align to its reference (i.e.,
Aligned Bases ), the fractionof identical portions between the aligned bases of an assembly and thereference (i.e.,
Accuracy ), a score value that is the product of accuracy and number of aligned bases (as a fraction), which we call the
PolishingScore . Accuracy valueprovidestheaccuracyofonlythealignedportionsofthe polished assembly, not the entire assembly. However, polishing score is a more comprehensive measure compared to accuracy , as it normalizesthe accuracy of the aligned portions of the polished assembly to the entirelength of the assembly. Second, we use sourmash (Titus Brown and Irber,2016) to calculate the k-mer similarity between filtered Illumina reads andan assembly. Third, we use QUAST (Gurevich et al. , 2013) to report a further quality assessment of assemblies based on the mapping of filteredIllumina reads to assemblies. Both k-mer similarity and QUAST providea reference-independent evaluation of assemblies.Based on our evaluation criteria, we compare Apollo to four state-of-the-art assembly polishing algorithms: Nanopolish (Loman et al. , 2015),Racon (Vaser et al. , 2017), Quiver (Chin et al. , 2013), and Pilon (Walker et al. , 2014). If an assembly polishing algorithm does not support a certaindataset, we do not run the algorithm on that dataset. For example, we useNanopolish only for the ONT dataset and Quiver only for PacBio datasets,and Pilon only for the Illumina dataset. We use Pilon with a PacBio datasetonly once to show its capability to polish an assembly using long reads,albeit very inefficiently. We include Apollo andRacon in every comparisonas they support a set of reads from any sequencing technology. For eachdataset, we compare the algorithms that polish an assembly using the sameset of reads. We run each assembly polishing algorithm with its defaultparameters.We run all the tools (i.e., assemblers, read mappers, and assemblypolishing algorithms) on a server with 24 cores (2 threads per core,Intel®Xeon®Gold 5118 CPU @ 2.30GHz), and 192GB of main memory.We assign 45 threads to all the tools we use and collect their runtime andmemory usage using the time command in Linux with the − vp options.We report runtime and peak memory usage of the assembly polishingalgorithms based on these configurations.We use state-of-the-art tools to construct an assembly and to generatea read-to-assembly alignment before running Apollo, which correspondto the input preparation steps. We use Canu (Koren et al. , 2017) andMiniasm (Li, 2016) tools to construct assemblies of each set of longreads. For read-to-assembly alignment, we use Minimap2 and BWA-MEMto align long and short reads to an assembly. Quiver cannot work withalignment results that Minimap2 and BWA-MEM produce, but requiresa certain type of aligner to align PacBio reads to an assembly. Thus, weuse the pbalign tool (https://github.com/PacificBiosciences/pbalign) thatuses BLASR (Chaisson and Tesler, 2012) to align PacBio reads to anassembly in order to generate a read-to-assembly alignment in the formatthat Quiver requires. We sort and index the resulting SAM/BAM read-to-assembly alignments using the SAMtools’ sort and index commands (Li et al. , 2009), respectively.After assembly generation, we divide the long reads into smaller chunks of size 1000bps (i.e., we perform chunking ). We do this becauselong reads cause high memory demand during the assembly polishing step,especially for large genomes (e.g., a human genome). This bottleneckexists not only for Apollo but also for other assembly polishing algorithms(e.g., Racon). For Apollo, dividing long reads into chunks preventspossible memory overflows due to the memory-demanding calculationof the Forward-Backward algorithm. Even though it is still possible touse long reads without chunking, we suggest using the resulting reads after chunking if the available memory is not sufficient to run Apollo.We show that chunking results in producing more accurate assemblies(Supplementary Table S18).Default parameters of Apollo are as follows: minimum mappingquality ( q = 0 ), maximum number of states that Forward-Backward( f = 100 ) and the Viterbi algorithms ( v = 5 ) evaluate for the nexttime step, the number of insertion states per base pair ( i = 3 ), the numberof base pairs decoded per sub-graph by Viterbi ( b = 5000 ), maximumdeletions per transition ( d = 10 ), transition probability to a match state( tm = 0 . ), transition probability to an insertion state ( ti = 0 . ),factor for the polynomial distribution to calculate each deletion transition( df = 2 . ), and match emission probability ( em = 0 . ). In our experiments, we use DNA-seq datasets from five different samplessequenced by multiple sequencing technologies, as we show in Table 1.We use a dataset from a large genome (i.e., a human genome) todemonstrate the scalability of polishing algorithms. For this purpose, weuse the human genome sample from the Ashkenazim trio (HG002, Son)to compare the computational resources (i.e., time and maximum memoryusage) that each polishing algorithm requires. We filter out the PacBioreads that have a length of less than 200 before calculating coverage andaverage read length.We use the
E. coli
O157 (Strain FDAARGOS_292),
E. coli
O157:H7,
E. coli
K-12 MG1655, and Yeast S288C datasets to evaluate the polishingaccuracy of Apollo and other state-of-the-art polishing algorithms in fourways. First, we evaluate whether using a hybrid set of reads with Apolloresults in more accurate assemblies compared to polishing an assemblytwice using a combination of other polishing tools (e.g., Racon + Pilon).
Firtina et al.
Table 1. Details of our datasets
Dataset Accession Number Details
E. coli
K-12 - ONT Loman Lab ∗ × coverage) E. coli
K-12 - Illumina SRA SRR1030394 2,720,956 paired-end reads (avg. 243bps each, 285 × coverage) E. coli
K-12 - Ground Truth GenBank NC_000913 Strain MG1655 (4,641Kbps)
E. coli
O157 - PacBio SRA SRR5413248 177,458 reads (avg. 4,724bps, 151 × coverage) E. coli
O157 - Illumina SRA SRR5413247 11,856,506 paired-end reads (150bps each, 643 × coverage) E. coli
O157 - Ground Truth GenBank NJEX02000001 Strain FDAARGOS_292 (5,566Kbps)
E. coli
O157:H7 - PacBio SRA SRR1509640 76,279 reads (avg. 8,270bps, 112 × coverage) E. coli
O157:H7 - Illumina SRA SRR1509643 2,978,835 paired-end reads (250bps each, 265 × coverage) E. coli
O157:H7 - Ground Truth GCA_000732965 Strain EDL933 (5,639Kbps)Yeast S288C - PacBio SRA ERR165511(8-9), ERR1655125 296,485 reads (avg. 5,735bps, 140 × coverage)Yeast S288C - Illumina SRA ERR1938683 3,318,467 paired-end reads (150bps each, 82 × coverage)Yeast S288C - Ground Truth GCA_000146055.2 Strain S288C (12,157Kbps)Human HG002 - PacBio SRA SRR2036(394-471), SRR203665(4-9) 15,892,517 reads (avg. 6,550bps, 35 × coverage)Human HG002 - Illumina SRA SRR17664(42-59) 222,925,733 paired-end reads (148bps each, 22 × coverage)Human HG002 - Ground Truth GCA_001542345.1 Ashkenazim trio - Son (2.99Gbps) The datasets we use in our experiments. This data can be accessed through NCBI using the accession number. ∗ The ONT datasets are available at http://lab.loman.net/2016/07/30/nanopore-r9-data-release/
Second, we measure the performance of the polishing algorithms whentheypolishtheassembliesonlyonce. Third, wesubsamplethe
E.coli
O157and
E. coli
K-12 datasets into 30 × coverage to compare the performance ofalgorithms when long read coverage is moderate. Fourth, we additionallyuse the Human HG002 dataset to measure the k-mer distance and qualityassessment of the assemblies using sourmash and QUAST, respectively. We use the polishing algorithms to polish a large genome assembly(e.g., a human genome) to observe (1) whether the polishing algorithmscan polish these large assemblies without exceeding the limitations ofthe computational resources we use to conduct our experiments and (2)the overall computational resources required to polish a large genomeassembly (i.e., alignment and polishing). For this purpose, we usethe PacBio and Illumina reads from the human genome sample of theAshkenazim trio (HG002, Son) to polish a finished assembly of the sameAshkenazim trio sample. The finished assembly was released by theGenome in a Bottle (GIAB) consortium (genomeinabottle.org). GIABused 1) Celera Assembler with PbCR (v. 8.3rc2) (Koren et al. , 2012) toassemble the PacBio reads from the HG002 sample and 2) Quiver to polishthe assembly (Wenger et al. , 2019). Based on our experiments that wereport in Table 2, we make four key observations. First, Pilon, Quiver, andRacon cannot polish the assembly using the whole sets of PacBio ( ∼ × coverage) and Illumina ( ∼ × coverage) reads due to high computationalresourcesthattheyrequire. RaconandPilonexceedthememorylimitationswhileusingeitherthePacBioorIlluminareadstopolishthehumangenomeassembly. Quiver cannot start polishing the assembly as the requiredaligner (i.e., BLASR from the pbalign tool) cannot produce the alignmentresult due to memory limitations. Apollo can polish an assembly using both PacBio and Illumina reads using at most nearly half of the availablememory. Second, we reduce the coverage of the PacBio reads to 8.9 × (SRA SRR2036394-SRR2036422) to observe whether Racon and Quivercan polish the large genome using a low coverage set of PacBio reads.We find that Racon is able to polish a human genome assembly usinglow coverage set of reads whereas BLASR cannot produce the alignmentresults that Quiver requires due to memory limitations even when usinga low coverage set of reads. Third, we split read-to-assembly alignmentinto multiple alignment files such that all reads mapped to each contig arerepresented in a separate alignment file (i.e., read-to-contig alignment) toevaluate whether Pilon, Quiver, and Racon can polish the entire humangenome using read-to-contig alignments. We observe that Pilon, Quiver,and Racon can polish contigs of a large genome, as Table 2 shows. We notethat when using pbalign, we align small batches of PacBio datasets (e.g.,1 × coverage each) and later merge the alignments of these small batches.We also note that both the size of the longest contig (i.e., 35.2Mbp) andthe number of short read alignments to the longest contig (i.e., 5,313,903)are ∼ × smaller than that of the entire assembly. When contigs longerthan 35Mbp are available, we expect Pilon and Racon to require morememory for polishing longer contigs since these tools cannot scale wellwith contig size. Fourth, Apollo requires less memory than any polishingalgorithm when polishing the human genome assembly contig by contig.WeconcludethatApolloisthe onlyalgorithm thatscaleswell(i.e., memoryrequirements do not increase dramatically as the genome size increases)in polishing large genomes using a set of both PacBio and Illumina reads without reducing the coverage of the read set or splitting the read set or thealignment file into smaller batches. Pilon, Quiver, and Racon can polish alarge genome assembly without reducing the coverage of a read set onlyif they polish the entire assembly contig-by-contig or split the readset intosmaller batches before alignment. We first examine whether the use of a hybrid set of reads (e.g., longand short reads) within a single polishing run provides benefit overpolishing an assembly twice using a set of reads from only a singlesequencing technology (e.g., only PacBio reads) in each run. Second, weevaluate assembly polishing algorithms and compare them to each othergiven different options with respect to 1) the sequencing technology thatproduces long reads, 2) the assembler that constructs an assembly usinglong reads, 3) the aligner that generates read-to-assembly alignment, and4) the set of reads that align to an assembly. We report the accuracy ofunpolished assemblies as well as the performance of assembly polishingalgorithms based on the evaluation criteria we explained in Section 3. Wealso compare the tools based on their performance given moderate (e.g., ∼ × ) and low (e.g., . × ) long read coverage. Apollo is either more accurate than or as accurate as runningPilon twice using a hybrid set of reads. Apollo also polishes Canu-generated assemblies more accurately for a species with PacBio readsthan running other polishing tools multiple times.
In Table 3 (completeresults in Supplementary Table S1) and Supplementary Table S2, wehighlight the benefits of using a hybrid set of reads (e.g., PacBio +Illumina) within a single polishing run compared to polishing an assemblyin multiple runs by using a set of reads from only a single sequencingtechnology (e.g., only PacBio or only Illumina) in each run. To this end,we compare the accuracy of polished assemblies using Apollo with thatof the polished assemblies using other polishing tools (Nanopolish, Pilon,Quiver, and Racon) that we run multiple times. We use long (PacBio orONT) and short (Illumina) reads from
E. coli
O157,
E. coli
O157:H7,
E. coli
K-12 MG1655, and Yeast S288C datasets to polish Canu- andMiniasm-generated assemblies. For the first run , we use the polishingalgorithms to polish Canu- and Miniasm-generated assemblies. For the second run , we provide Nanopolish, Pilon, Quiver, and Racon with thepolished assembly from the first run and run these tools for the second time(i.e.,
Second Run ). Based on Supplementary Tables S1 and S2, we makethree key observations. First, Apollo and Pilon are the only algorithmsthat always polish a Canu-generated assembly with a polishing score either equal to or better than that of the original Canu-generated assembly.Second, running other polishing tools multiple times to polish a Miniasm-generated assembly usually results in assemblies with higher polishingscores (e.g., by at most 3.79% for PacBio and 7.57% for ONT read sets)than using Apollo with a hybrid set of reads. Third, Apollo performsbetter when it uses PacBio reads in the hybrid set than using ONT reads.We conclude that the use of Apollo once with a hybrid set of reads thatincludes PacBio reads and a Canu-generated assembly is the best pipeline(i.e., one can construct the most accurate assemblies for a species versusrunning other polishing tools multiple times).
Apollo performs better than Pilon and comparable to Racon andQuiver when polishing a Canu-generated assembly using only a highcoverage set of PacBio or Illumina reads . In Supplementary Tables pollo S3, S6, and S12, we use PacBio and Illumina datasets to compare theperformance of Apollo with Racon (Vaser et al. , 2017), Quiver (Chin et al. ,2013), and Pilon (Walker et al. , 2014). Based on these datasets, we makefive observations. First, Apollo usually outperforms Pilon (i.e., 4 out of7, see the
Polishing Score column) using a set of short reads. Second,Apollo, Racon, and Quiver show significant improvements over theoriginal Miniasm assembly in terms of accuracy. Third, Quiver and Raconpolish the Miniasm-generated assembly more accurately than Apollo (seethe
Accuracy and the
Polishing Score columns). Fourth, Apollo producesmore accurate assemblies than the assemblies polished by Racon whenwe use moderate ( ∼ × ) and high coverage (151 × ) PacBio read setsto polish Canu-generated assemblies. However, both algorithms generateassemblies with lower accuracy than the accuracy of the original Canu-generated assembly ( . with the polishing score of . ) when weuse high coverage read sets. Based on this observation, we suspect that theuse of the original set of long reads (i.e., the set of reads that we use toconstruct an assembly) is not helpful as Canu corrects long reads beforeconstructing an assembly. Thus, we also tried using the Canu-correctedlong reads to polish a Canu-generated assembly. However, the use ofcorrected long reads did not consistently result in generating more accurateassemblies than the assemblies polished using the original set of long readsaswereportinSupplementaryTablesS3andS9. Wefindthatthealignmentof Canu-corrected long reads to an erroneous assembly generates a smallernumber of alignments than the alignment of the original long reads to thesame erroneous assembly, as we show in Supplementary Table S17. Webelieve that the decrease in the number of alignments results in loss ofinformation that assembly polishing algorithms use to polish an assembly,which subsequently leads to either similar or worse assembly polishingaccuracy than using original set of long reads. Fifth, even though Pilon isnot optimized to use long reads, we use Pilon to polish an assembly usinglong reads to observe if it polishes the assembly with comparable accuracyto the other polishing algorithms. We observe that Pilon significantly fallsbehind the other polishing algorithms in terms of our evaluation criteria.Thus, we do not use Pilon with long reads. We conclude that 1) Apollousually performs better than Pilon when using short reads and 2) Apollo’sperformance is comparable to Racon and Quiver when using long PacBioreads to polish an assembly. Apollo performs better than Pilon and Nanopolish when polishinga Miniasm-generated assembly using only a set of Illumina and ONTreads, respectively.
We also investigate the performance of Apollo giventhe ONT dataset (
E. coli
K-12 MG1655), compared to Nanopolish andRacon. We make two key observations based on the results we showin Supplementary Table S9. First, Racon provides the best performancein terms of the accuracy of contigs when the coverage is high (319 × )and the accuracy of the original assembly is low (e.g., a Miniasm-generated assembly). In the same setup, Apollo produces a more accurateassembly than Nanopolish. Second, even though Nanopolish produces themost accurate results with Canu using either high coverage (319 × ) ormoderate coverage ( ∼ × ) data, Apollo’s polishing score differs onlyby at most ∼ . %. We conclude that Racon performs better than thecompeting state-of-the-art polishing algorithms if the coverage of a setof reads is high (e.g., 319 × ). Apollo outperforms Nanopolish whenpolishing a Miniasm-generated assembly but Nanopolish outperformsRacon and Apollo when polishing a Canu-generated assembly. Thus, wealso conclude that the accuracy of the original assembly dramaticallyaffects the overall performance of Nanopolish as there is a significantperformance difference between polishing Miniasm and polishing Canuassemblies. We suspect that the default parameter settings of Apollo maybe a better fit for PacBio reads rather than ONT reads, which explains whyApollo performs worse with ONT datasets compared to PacBio datasets. Apollo is robust to different parameter choices . In SupplementaryTables S19 - S21, we use the
E. coli
O157 dataset to examine if Apollois robust to using different parameter settings. To study the change in theperformance of Apollo, we change the following parameters: maximumnumber of states that the Forward-Backward and the Viterbi algorithmsevaluate for the next time step ( f ), number of insertion states per basepair ( i ), maximum deletion length per transition ( d ), transition probabilityto a match state ( tm ), transition probability to an insertion state ( ti ). WeconcludethatApollo’sperformanceisrobusttodifferentparameterchoicesas the accuracies of the Apollo-polished assemblies differ by at most 2%. We report both 1) the k-mer distance (i.e., Jaccard similarity (Niwattanakul et al. , 2013) or k-mer similarity ) between filtered
Illumina readsand assemblies, and 2) quality assessment based on mapping these
Table 2. Applicability, runtime, and memory requirements of four assemblypolishing tools on a complete human genome assembly
Aligner Sequencing Tech. Polishing Runtime Memoryof the Reads Algorithm (GB)
Minimap2 PacBio (35 × ) Apollo 228h 43m 13s 62.91BWA-MEM PacBio (35 × ) Apollo 200h 13m 06s 58.60Minimap2 PacBio (35 × ) Racon N/A N/ABWA-MEM PacBio (35 × ) Racon N/A N/Apbalign PacBio (35 × ) Quiver N/A N/AMinimap2 PacBio (8.9 × ) Apollo 56h 21m 56s 44.99BWA-MEM PacBio (8.9 × ) Apollo 42h 19m 09s 45.00Minimap2 PacBio (8.9 × ) Racon 3h 31m 37s 54.13BWA-MEM PacBio (8.9 × ) Racon 2h 17m 21s 51.55pbalign PacBio (8.9 × ) Quiver N/A N/AMinimap2 Illumina (22 × ) Apollo 98h 07m 05s 101.12BWA-MEM Illumina (22 × ) Apollo 105h 15m 05s 107.06Minimap2 Illumina (22 × ) Racon N/A N/ABWA-MEM Illumina (22 × ) Racon N/A N/AMinimap2 Illumina (22 × ) Pilon N/A N/AMinimap2 Illumina (22 × ) Pilon N/A N/AMinimap2 PacBio (35 × ) Apollo ∗ × ) Quiver ∗ × ) Racon ∗
6h 48m 17s 132.51Minimap2 Illumina (22 × ) Apollo ∗ × ) Apollo ∗ × ) Pilon ∗
13h 59m 32s 66.67BWA-MEM Illumina (22 × ) Pilon ∗
21h 15m 57s 49.93
We polished the assembly of the Ashkenazim trio sample (HG002, Son) for differentcombinations of sequencing technology, aligner, and polishing algorithm. We reportthe runtime and the memory requirements of the assembly polishing tools (i.e.,Aligner+Polishing). Wereport
Runtime and
Memory asN/A,ifapolishingalgorithmfails to polish the assembly. ∗ denotes that we polish the assembly contig by contig in these runs and collect the results once all of the contigs are polished separately. filtered Illumina reads to assemblies to provide a reference-independentcomparison between the polishing tools. We filter Illumina reads inthree steps to get rid of erroneous short reads before using them. First,we remove the adapter sequences (i.e., adapter trimming). Second, weapply contaminant filtering for synthetic molecules. Third, we map thereads generated after the first three steps to the reference and filterout the reads that do not map to the reference. We use BBTools(sourceforge.net/projects/bbmap/) in these steps of filtering. To calculatek-mer similarity, we also use trim-low-abund (Zhang et al. , 2015), whichapplies k-mer abundance trimming to remove k-mers with abundancelower than 10 for
E. coli and Yeast datasets, and 3 for the human genome.In k-mer similarity calculations, Jaccard similarity provides how a setof k-mers of both Illumina reads and an assembly are similar to each other.WecomparethefilteredIlluminareadswithbothpolishedandoriginal(i.e.,unpolished) assemblies of the small genomes (i.e., Yeast and
E. coli ) andthe large genomes (i.e., human); the results are in Supplementary TablesS4, S7, S10, S13, and S15. We show the percentage of both the k-mersof Illumina reads present in the assembly and the k-mers of the assemblypresent in Illumina reads. The latter helps us to identify how accurate theassembly is whereas the former shows the completeness of the assembly.Based on our experiments on small genomes, we make three keyobservations. First, the tool with the highest assembly accuracy , estimatedwith k-mer similarity (shown in Supplementary Tables S4, S7, S10, S13),typically provides the highest polishing score in its category (shown inSupplementary Tables S3, S6, S9, S12), respectively. Second, Quiverusually produces more accurate assemblies than the assemblies generatedby other polishing tools. Third, all polishing algorithms we evaluatedramatically increase the accuracy of the unpolished assembly generatedby Miniasm. We conclude that the k-mer similarity results correlate withour findings in Section 3.4 and support our claims regarding how polishedassemblies compare with the ground truth.Based on the k-mer similarity results between the Illumina reads andthe human genome assemblies, we make five key observations. First, weobserve a reduction in the accuracy when polishing algorithms use rawPacBio reads as the finished assembly was generated using corrected
PacBio reads and already polished by Quiver. Second, the polishingalgorithms produce more accurate assemblies than the finished assembly only when they use short reads to polish an assembly. This is because 1)Illumina reads are more accurate than raw PacBio reads and 2) Illuminareads have not been used when polishing the HG002 assembly, whichleaves room to improve the accuracy. Third, Apollo performs better thanRacon in terms of both the completeness and the accuracy of the polishedassemblies and better than Quiver in terms of accuracy (based on 51-merresults). Fourth, Apollo performs better than Pilon when it polishes theassembly using short reads. Fifth, using a low coverage readset to polish ahuman genome assembly dramatically reduces both the completeness of
Firtina et al.
Table 3. Comparison between using a hybrid set of reads with Apollo and running other polishing tools twice to polish a Canu-generated assembly
Dataset First Run Second Run Aligned Accuracy Polishing Runtime MemoryBases (%) Score (GB)
E. coli
O157 — — 99.94 0.9998 0.9992 43m 53s 3.79
E. coli
O157 Apollo (Hybrid) — 99.94 0.9999
8h 16m 08s 13.85
E. coli
O157 Racon (PacBio) Racon (Illumina) 99.94 0.9994 0.9988 21m 44s 22.65
E. coli
O157 Pilon (Illumina) Racon (PacBio) 99.94 0.9986 0.9980
4m 58s
E. coli
O157 Quiver (PacBio) Pilon (Illumina) 99.94 0.9998 0.9992 5m 01s
E. coli
O157:H7 — — 100.00 0.9998 0.9998 43m 19s 3.39
E. coli
O157:H7 Apollo (Hybrid) — 100.00 0.9999
5h 58m 05s 8.86
E. coli
O157:H7 Racon (PacBio) Racon (Illumina) 100.00 0.9995 0.9995 9m 43s
E. coli
O157:H7 Pilon (Illumina) Racon (PacBio) 100.00 0.9996 0.9996
6m 04s
E. coli
K-12 — — 99.98 0.9794 0.9792 34h 21m 46s 5.06
E. coli
K-12 Apollo (Hybrid) — 99.99 0.9953 0.9952 9h 09m 50s 9.35
E. coli
K-12 Racon (ONT) Racon (Illumina) 100.00 0.9996
E. coli
K-12 Pilon (Illumina) Racon (ONT) 99.99 0.9997
15m 51s 8.84
E. coli
K-12 Nanopolish (ONT) Pilon (Illumina) 99.99 0.9992 0.9991 9h 45m 01s 18.10Yeast S288C — — 99.89 0.9998 0.9987 1h 20m 39s 6.24Yeast S288C Apollo (Hybrid) — 99.89 0.9998
11h 08m 41s
Yeast S288C Racon (PacBio) Racon (Illumina) 99.89 0.9994 0.9983 38m 21s 6.93Yeast S288C Pilon (Illumina) Racon (PacBio) 99.89 0.9960 0.9949 21m 42s 11.85Yeast S288C Quiver (PacBio) Pilon (Illumina) 98.95 0.9998 0.9893
12m 47s
We use the long reads of
E. coli
O157,
E. coli
O157:H7,
E. coli
K-12, and Yeast S288C datasets that are sequenced from PacBio and ONT (151 × , 112 × , 319 × , and 140 × coverage, respectively) to generate their assemblies with Canu . Here, the polishing tools specified under
First Run and
Second Run polish the assembly using the set of readsspecified in parentheses. The set of reads used in the second run is aligned to the assembly polished in the first run using Minimap2. PacBio and Illumina set of reads togetherconstitute the hybrid set of reads (i.e.,
Hybrid ). We report the performance of the polishing tools in terms of the percentage of bases of an assembly that aligns to its reference(i.e.,
Aligned Bases ), the fraction of identical portions between the aligned bases of an assembly and the reference (i.e.,
Accuracy ) as calculated by dnadiff, and
Polishing Score value that is the product of
Accuracy and
Aligned Bases (as a fraction). We report the runtime and the memory requirements of the assembly polishing tools. We show the bestresult among assembly polishing algorithms for each performance metric in bold text. the assembly and the accuracy of the assembly. We conclude that 1) Apollooutperforms Pilon on Illumina data, and 2) it is not advisable to use rawPacBio reads to polish the large genome assemblies that have alreadybeen polished using more accurate reads than the raw PacBio reads (e.g.,corrected PacBio reads).We use QUAST (Gurevich et al. , 2013), a quality assessment toolfor genome assemblies, to provide a different reference-independentassessment of the assemblies. QUAST takes paired-end filtered
Illuminareads to generate several metrics such as percentage of 1) mapped reads,2) properly paired reads, 3) average depth of coverage, and 4) bases withat least 10 × coverage. It also calculates the GC content (i.e., the ratioof bases that are either G or C) of the assembly. Based on the qualityassessment results that we show in Supplementary Tables S5, S8, S11,S14, and S16, we make two key observations. First, for human genomeassemblies, Apollo performs better than Racon and comparable to Pilonin terms of the percentage of the mapped reads, properly paired reads, andthe bases with at least 10 × read coverage. Second, for small genomes (i.e.,Yeast and E. coli ), Quiver usually performs best in all of the metrics. Weconclude that Apollo provides better performance when polishing largegenomes than Racon, and Quiver usually performs better than any otherpolishing algorithm for small genomes.
We report the runtimes and the maximum memory requirements ofboth assemblers and assembly polishing algorithms in SupplementaryTables S1, S2, S3, S6, S9, and S12. Based on the runtimes of only assembly polishing algorithms (i.e., Apollo, Nanopolish, Pilon, Quiver,andRacon), wemakethreeobservations. First, themachinelearning-basedassembly polishing tools, Apollo and Nanopolish, are the most time-consumingalgorithmsduetotheircomputationallyexpensivecalculations.For example, Racon is ∼ × and ∼ × faster than Apollo whenpolishing Miniasm-generated assemblies using PacBio and ONT read sets,respectively. Second, Racon becomes more memory-bound as the overallnumber of long reads in a read set increases (shown in Table 2). This showsthat Racon’s memory requirements are directly proportional to the size ofthe read set (i.e., the overall number of base pairs in a read set). Third,Quiver always requires the least amount of memory for E. coli and Yeastgenomes compared to the competing algorithms.In Supplementary Tables S1 and S2, we evaluate the overall runtimeand memory requirements of 1) polishing an assembly within a single runby using a hybrid set of reads with Apollo and 2) polishing an assemblymultiple times. We observe that the overall runtime of running polishingtools multiple times is still lower at least by an order of magnitude thanrunning Apollo once with a hybrid set of reads. However, Apollo can provide a more accurate assembly for a species when a Canu-generatedassembly is polished, as discussed in Section 3.4.We report the runtimes, maximum memory requirements, and theparameters of the aligners we evaluated in Supplementary Tables S17 andS22, respectively, to observe how the aligner affects the overall runtimeof both the aligner the assembly polishing tool. Based on the runtimesof aligners, we make two observations. First, pbalign is the most time-consuming and memory-demanding alignment tool. Overall, this makesQuiver require more time and memory than Racon, since Quiver can onlywork with BLASR, a part of pbalign tool. Second, all evaluated polishingtools except Quiver allow using any aligner; therefore, we only comparethe runtime of the polishing tools, rather than comparing runtime of thefull pipeline (i.e., aligner plus polishing tool) for the non-human genomedatasets. We conclude that Quiver is the only algorithm whose runtimemust be considered in conjunction with the aligner, as it can only use onealigner, pbalign, which we show in Table 2.
Weshowthatthereisadramaticdifferencebetweennon-machinelearning-based algorithms and the machine learning-based ones in terms of runtime.Apollo and Nanopolish usually require several hours to complete thepolishing. Racon, Quiver, and Pilon usually require less than an hour(Supplementary Tables S1, S2, S3, S6, S9, and S12), which may suggestthat Racon and Pilon can use a hybrid set of reads to polish an assembly inmultiple runs instead of using Apollo in a single run. Indeed, we confirmthat running Racon, Pilon, or Quiver multiple times still takes a muchshorter time than running Apollo once using a hybrid set of reads withina single run. However, assembly polishing is a one-time task performedfor an assembly that is usually used many times and even made publiclyavailable to the community. Therefore, we believe that long runtimescould still be acceptable given that genomic data produced by Apollo willprobably be used many times after it is generated. Hence, Apollo’s runtimecost is paid only once but benefits are reaped many times. Note that thisobservation is not restricted to Apollo and applies to any polishing tool thathas a long runtime. In addition, it is possible to accelerate the calculation ofthe Forward-Backward algorithm and the Viterbi algorithm using Tensorcores, SIMD and GPUs (Murakami, 2017; Eddy, 2011; Liu, 2009; Yu et al. , 2014), which we leave to future work.Despite these slower runtimes of Apollo compared to other polishingtools, Apollo is new, unique, and useful because it provides two majorfunctionalities that are not possible with prior tools. First, Apollo is the only algorithm that can scale itself well to polish a large genome assemblyusing a readset with moderate coverage (e.g., up to ∼ × ) set of reads.Therefore, it is possible to polish a large genome with a relatively small pollo amount of memory (i.e., less than 110GB) only with Apollo. Second,Apollo can construct more reliable Canu-generated assemblies comparedto running other polishing tools multiple times when both PacBio andIllumina reads are used (i.e., a hybrid set of reads). These two advantagesare only possible if Apollo is used for assembly polishing. In this paper, we present a universal, sequencing-technology-independentassembly polishing algorithm, Apollo. Apollo uses all available reads topolish an assembly and removes the dependency of the polishing tool onsequencing technology. Apollo is the first polishing algorithm that scaleswell to use any arbitrary hybrid set of reads within a single run to polishboth large and small genomes. Apollo also removes the requirement ofusing assembly polishing algorithms multiple times to polish an assemblyas it allows using a hybrid set of reads.We show three key results. First, three state-of-the-art polishingalgorithms, Quiver, Racon, and Pilon, cannot scale well to polish largegenome assemblies without splitting the assembly into its contigs orread sets into smaller batches whereas Apollo scales well to polish largegenomes. Second, using a hybrid set of reads with Apollo usually resultsin constructing Canu-generated assemblies more accurate than thosegeneratedwhenrunningotherpolishingtoolsmultipletimes. Third, Apollousually polishes assemblies with comparable accuracy to state-of-the-artassembly polishing algorithms with a few exceptions that occur when longreads are used to polish Miniasm-generated assemblies. We conclude thatApolloisthefirstuniversal, sequencing-technology-independentassemblypolishing algorithm that can use a hybrid set of reads within a single runto polish both large and small assemblies, while achieving high accuracy.
Funding
This work was supported by gifts from Intel [to O.M.]; VMware [to O.M.];and TÜB˙ITAK [TÜB˙ITAK-1001-215E172 to C.A.].
References
Alkan, C., Sajjadian, S., and Eichler, E. E. (2011). Limitations of next-generationgenome sequence assembly.
Nature Methods , (1), 61–65.Alser, M., Hassan, H., Xin, H., Ergin, O., Mutlu, O., and Alkan, C. (2017).GateKeeper: a new hardware architecture for accelerating pre-alignment in DNAshort read mapping. Bioinformatics , (21), 3355–3363.Alser, M., Hassan, H., Kumar, A., Mutlu, O., and Alkan, C. (2019a). Shouji: a fastand efficient pre-alignment filter for sequence alignment. Bioinformatics , (21),4255–4263.Alser, M., Shahroodi, T., Gomez-Luna, J., Alkan, C., and Mutlu, O. (2019b).SneakySnake: A Fast and Accurate Universal Genome Pre-Alignment Filter forCPUs, GPUs, and FPGAs.Au, K. F., Underwood, J. G., Lee, L., and Wong, W. H. (2012). Improving PacBioLong Read Accuracy by Short Read Alignment. PLoS One , (10), e46679.Baum, L. E. (1972). An inequality and associated maximization technique instatistical estimation of probabilistic functions of a Markov process. Inequalities , , 1–8.Berlin, K., Koren, S., Chin, C.-S., Drake, J. P., Landolin, J. M., and Phillippy,A. M. (2015). Assembling large genomes with single-molecule sequencing andlocality-sensitive hashing. Nature Biotechnology , (6), 623–630.Chaisson, M., Pevzner, P., and Tang, H. (2004). Fragment assembly with short reads. Bioinformatics , (13), 2067–2074.Chaisson, M. J. and Tesler, G. (2012). Mapping single molecule sequencing readsusing basic local alignment with successive refinement (BLASR): application andtheory. BMC Bioinformatics , (1), 238.Chaisson, M. J. P., Wilson, R. K., and Eichler, E. E. (2015). Genetic variation and thede novo assembly of human genomes. Nature Reviews Genetics , (11), 627–640.Chin, C.-S., Alexander, D. H., Marks, P., Klammer, A. A., Drake, J., Heiner, C.,Clum, A., Copeland, A., Huddleston, J., Eichler, E. E., Turner, S. W., and Korlach,J.(2013). Nonhybrid, finishedmicrobialgenomeassembliesfromlong-readSMRTsequencing data. Nature Methods , (6), 563–569.Döring, A., Weese, D., Rausch, T., and Reinert, K. (2008). SeqAn An efficient,generic C++ library for sequence analysis. BMC Bioinformatics , (1), 11.Eddy, S. R. (2011). Accelerated Profile HMM Searches. PLoS ComputationalBiology , (10), e1002195.Firtina, C. and Alkan, C. (2016). On genomic repeats and reproducibility. Bioinformatics , (15), 2243–2247.Firtina, C., Bar-Joseph, Z., Alkan, C., and Cicek, A. E. (2018). Hercules: a profileHMM-based hybrid error correction algorithm for long reads. Nucleic AcidsResearch , (21), e125–e125.Glenn, T. C. (2011). Field guide to next-generation DNA sequencers. MolecularEcology Resources , (5), 759–769.Gurevich, A., Saveliev, V., Vyahhi, N., and Tesler, G. (2013). QUAST: qualityassessment tool for genome assemblies. Bioinformatics , (8), 1072–1075.Huddleston, J., Ranade, S., Malig, M., Antonacci, F., Chaisson, M., Hon, L.,Sudmant, P. H., Graves, T. A., Alkan, C., Dennis, M. Y., Wilson, R. K., Turner,S. W., Korlach, J., and Eichler, E. E. (2014). Reconstructing complex regionsof genomes using long-read sequencing technology. Genome Research , (4),688–696.Jain, M., Koren, S., Miga, K. H., Quick, J., Rand, A. C., Sasani, T. A., Tyson,J. R., Beggs, A. D., Dilthey, A. T., Fiddes, I. T., Malla, S., Marriott, H., Nieto, T., O’Grady, J., Olsen, H. E., Pedersen, B. S., Rhie, A., Richardson, H., Quinlan,A. R., Snutch, T. P., Tee, L., Paten, B., Phillippy, A. M., Simpson, J. T., Loman,N. J., and Loose, M. (2018). Nanopore sequencing and assembly of a humangenome with ultra-long reads. Nature Biotechnology , (4), 338–345.Kim, J. S., Senol Cali, D., Xin, H., Lee, D., Ghose, S., Alser, M., Hassan, H., Ergin,O., Alkan, C., and Mutlu, O. (2018). GRIM-Filter: Fast seed location filtering inDNA read mapping using processing-in-memory technologies. BMC Genomics , (S2), 89.Koren, S., Schatz, M. C., Walenz, B. P., Martin, J., Howard, J. T., Ganapathy,G., Wang, Z., Rasko, D. A., McCombie, W. R., Jarvis, E. D., and Phillippy,A. M. (2012). Hybrid error correction and de novo assembly of single-moleculesequencing reads. Nature Biotechnology , (7), 693–700.Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H., and Phillippy,A. M. (2017). Canu: scalable and accurate long-read assembly via adaptive k -merweighting and repeat separation. Genome Research , (5), 722–736.Kurtz, S., Phillippy, A., Delcher, A.L., Smoot, M., Shumway, M., Antonescu, C., andSalzberg, S. L. (2004). Versatile and open software for comparing large genomes. Genome Biology , (2), R12.Li, H. (2016). Minimap and miniasm: fast mapping and de novo assembly for noisylong sequences. Bioinformatics , (14), 2103–2110.Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics , (18), 3094–3100.Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics , (14), 1754–1760.Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G.,Abecasis, G., and Durbin, R. (2009). The Sequence Alignment/Map format andSAMtools. Bioinformatics , (16), 2078–2079.Liu, C.(2009). cuHMM:aCUDAImplementationofHiddenMarkovModelTrainingand Classification. The Chronicle of Higher Education , pages 1–13.Loman, N. J., Quick, J., and Simpson, J. T. (2015). A complete bacterial genomeassembled de novo using only nanopore sequencing data.
Nature Methods , (8),733–735.Meltz Steinberg, K., Schneider, V. A., Alkan, C., Montague, M. J., Warren, W. C.,Church, D. M., and Wilson, R. K. (2017). Building and Improving ReferenceGenome Assemblies. Proceedings of the IEEE , (3), 1–14.Murakami, T. (2017). Expectation-Maximization Tensor Factorization for PracticalLocation Privacy Attacks. Proceedings on Privacy Enhancing Technologies , (4), 138–155.Niwattanakul, S., Singthongchai, J., Naenudorn, E., andWanapu, S.(2013). UsingofJaccard Coefficient for Keywords Similarity. In Proceedings of The InternationalMultiConference of Engineers and Computer Scientists , volume 1, pages 380–384.Payne, A., Holmes, N., Rakyan, V., and Loose, M. (2018). BulkVis: a graphicalviewer for Oxford nanopore bulk FAST5 files.
Bioinformatics .Pearson, W. R. and Lipman, D. J. (1988). Improved tools for biological sequencecomparison.
Proceedings of the National Academy of Sciences , (8), 2444–2448.Rhoads, A.andAu, K.F.(2015). PacBioSequencingandItsApplications. Genomics,Proteomics & Bioinformatics , (5), 278–289.Salmela, L. and Rivals, E. (2014). LoRDEC: accurate and efficient long read errorcorrection. Bioinformatics , (24), 3506–3514.Salmela, L., Walve, R., Rivals, E., and Ukkonen, E. (2016). Accurate self-correctionof errors in long reads using de Bruijn graphs. Bioinformatics , (6), 799–806.Sanger, F., Nicklen, S., and Coulson, A. R. (1977). DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences , (12),5463–5467.Senol Cali, D., Kim, J. S., Ghose, S., Alkan, C., and Mutlu, O. (2019). Nanoporesequencing technology and tools for genome assembly: computational analysis ofthe current state, bottlenecks and future directions. Briefings in Bioinformatics , (4), 1542–1559.Titus Brown, C. and Irber, L. (2016). sourmash: a library for MinHash sketching ofDNA. The Journal of Open Source Software , (5), 27.Vaser, R., Sovi´c, I., Nagarajan, N., and Šiki´c, M. (2017). Fast and accurate denovo genome assembly from long uncorrected reads. Genome Research , (5),737–746.Viterbi, A. (1967). Error bounds for convolutional codes and an asymptoticallyoptimum decoding algorithm. IEEE Transactions on Information Theory , (2),260–269.Walker, B. J., Abeel, T., Shea, T., Priest, M., Abouelliel, A., Sakthikumar, S.,Cuomo, C. A., Zeng, Q., Wortman, J., Young, S. K., and Earl, A. M. (2014). Pilon:An Integrated Tool for Comprehensive Microbial Variant Detection and GenomeAssembly Improvement. PLoS One , (11), e112963.Weirather, J. L., de Cesare, M., Wang, Y., Piazza, P., Sebastiano, V., Wang, X.-J.,Buck, D., and Au, K. F. (2017). Comprehensive comparison of Pacific Biosciencesand Oxford Nanopore Technologies and their applications to transcriptomeanalysis. F1000Research , (100), 100.Wenger, A.M., Peluso, P., Rowell, W.J., Chang, P.-C., Hall, R.J., Concepcion, G.T.,Ebler, J., Fungtammasan, A., Kolesnikov, A., Olson, N. D., Töpfer, A., Alonge,M., Mahmoud, M., Qian, Y., Chin, C.-S., Phillippy, A. M., Schatz, M. C., Myers,G., DePristo, M. A., Ruan, J., Marschall, T., Sedlazeck, F. J., Zook, J. M., Li,H., Koren, S., Carroll, A., Rank, D. R., and Hunkapiller, M. W. (2019). Accuratecircular consensus long-read sequencing improves variant detection and assemblyof a human genome. Nature Biotechnology , (10), 1155–1162.Xin, H., Lee, D., Hormozdiari, F., Yedkar, S., Mutlu, O., and Alkan, C. (2013).Accelerating read mapping with FastHASH. BMC Genomics , (1), S13.Yu, L., Ukidave, Y., and Kaeli, D. (2014). GPU-Accelerated HMM for SpeechRecognition. In , pages 395–402. IEEE.Zhang, Q., Awad, S., andBrown, C.T.(2015). Crossingthestreams: a frameworkforstreaming analysis of short DNA sequencing reads. PeerJ PrePrints , , e890v1. upplementary Material forApollo: A Sequencing-Technology-Independent, Scalable, andAccurate Assembly Polishing Algorithm Can Firtina, Jeremie S. Kim, Mohammed Alser, Damla Senol Cali, A. Ercument Cicek,Can Alkan, and Onur Mutlu
Apollo constructs a profile hidden Markov model graph (pHMM-graph) to represent the sequences ofcontig as well as the errors that a contig may have. A pHMM-graph includes states and directed transitionsfrom a state to another. There are two types of probabilities that the graph contains: (1) emission and (2)transition probabilities. First, each state has emission probabilities for emitting certain characters whereeach character is associated with a probability value with the range [0 , . Each emission probabilityreveals how likely it is to emit (e.g., consume or output) a certain character when a certain state isvisited. Second, each transition is associated with a probability value with the range [0 , . A transitionprobability shows the probability of visiting a state from a certain state. Thus, one can calculate thelikelihood of emitting all the characters in a given sequence by traversing a certain path in the graph.The structure of the pHMM-graph allows us to handle insertion, deletion, and substitution errors byfollowing certain states and transitions. Now, we will explain the structure of the graph in detail. Foran assembly contig C , let us define the pHMM-graph that represents the contig C as G ( V, E ) . Let usalso define the length of the contig C as n = | C | . A base C [ t ] has one of the letters in the alphabet set Σ = { A, C, G, T } . Thus, a state emits one of the characters in Σ with a certain probability. For a state i , We denote the emission probability of a base c ∈ Σ as e i ( c ) ∈ [0 , where P c ∈ Σ e i ( c ) = 1 . We denote thetransition probability from a state, i, to another state, j, as α ij ∈ [0 , . For the set of the states that thestate i has an outgoing transition to, V i , we have P j ∈ V i α ij = 1 . Now let us define in four steps how Apolloconstructs the states and the transitions of the graph G ( V, E ) :First, Apollo constructs a start state, v start ∈ V , and an end state v end ∈ V . Second, for each base C [ t ] where ≤ t ≤ n , Apollo constructs a match state as follows (Figure S1): • A match state that we denote as M t for the base C [ t ] where M = C [ t ] s.t. C [ t ] ∈ Σ and M t ∈ V (i.e., if the t th base of the contig C is G , then the corresponding match state is G t ). For the followingsteps, let us assume i = M t • A match emission with the probability β , for the base C [ t ] s.t. e i ( C [ t ]) = β . β is a parameter toApollo. • A substitution emission with the probability δ , for each base c ∈ Σ and c = C [ t ] s.t. e i ( c ) = δ (Notethat β + 3 δ = 1 ). δ is a parameter to Apollo. • A match transition with the probability α M , from the match state M t = i to the next match state M t +1 = j s.t. α ij = α M . α M is a parameter to Apollo.Third, for each base C [ t ] where ≤ t ≤ n , Apollo constructs the insertion states as follows (Figure S2): • There are l many insertion states , I t , I t , . . . , I lt , where I it ∈ V , ≤ i ≤ l and l is a parameter toApollo • The match state, M t = i , has an insertion transition to I t = j , with the probability α I s.t. α ij = α I • For each i where ≤ i < l , the insertion state I it = k has an insertion transition to the nextinsertion state I i +1 t = j with the probability α I s.t. α kj = α I a r X i v : . [ q - b i o . GN ] M a r igure S1: Two match states. Here, the contig includes the bases G and A at the locations t and t + 1 ,respectively. The corresponding match states are labeled with the bases that they correspond to (i.e.,the match state G t represents the base G at the location t ). Each match state has a match transition tothe next match state with the initial probability α M . A match state has a match emission probability, β , for the base it is labeled with. The remaining three bases have equal substitution emission probability δ . The figure is taken from Hercules [1]. • For each i where ≤ i < l , the insertion state I it = k has a match transition to the match state ofthe next base M t +1 = j with the probability α M s.t. α kj = α M • The last insertion state, I lt , has no further insertion transitions. Instead, it has a transition to thematch state of the next base M t +1 = j with the probability α M + α I s.t. α kj = α M + α I • For each i where ≤ i ≤ l , each base c ∈ Σ and c = C [ t + 1] has an insertion emission probability / ≈ . for the insertion state I it = k s.t. e k ( c ) = 0 . and e k ( C [ t + 1]) = 0 . Note that P c ∈ Σ e k ( c ) = 1 . (i.e., if the base at the location t +1 is T, then e k ( A ) = 0 . , e k ( T ) = 0 , e k ( G ) = 0 . ,and e k ( C ) = 0 . ).Fourth step for finalizing the complete structure of the pHMM graph, for each state i ∈ V , Apolloconstructs the deletion transitions as follows (Figure S3): • Let us define α del = 1 − ( α M − α I ) , which is the overall deletion transition probability. • There are k many deletion transitions from the state i , to the further match states. k is a parameterto Apollo. • We assume that a transition deletes the bases if it skips the corresponding match states of the bases.We denote the transition probability of a deletion transition as α xD s.t. ≤ x ≤ k , if it deletes x many bases in a row in one transition. Apollo calculates the deletion transition probability α xD using the normalized version of a polynomial distribution where f ∈ [0 , ∞ ) is a factor value for theequation: α xD = f k − x α delk − P j =0 f j ≤ x ≤ k (S1) • If the f value is set to , then the each deletion transition is equally likely (i.e., α D = α D , if k ≥ ). As the f value increases, the probability of deleting more bases in one transition decreasesaccordingly (i.e., α D (cid:29) α D , if k ≥ ). f is a parameter to Apollo.We note that the start state v start also has a match transition to M and deletion transitions asdefined previously. There are al l many insertion states, I , I , . . . , I l , between the start state and thefirst match state M . The transitions of these insertion states are also identical to what we describedbefore. We would also like to note that the end state v end has no outgoing transition. The prior statesconsider v end as a match state and connect to it accordingly. The start and end states have no emissionprobabilities.Note that the design of pHMM-graph described here and proposed in Hercules [1] is different from theconventional pHMM-graphs [2]. One significant difference is that the conventional pHMM-graphs havedeletion states for each match state whereas the pHMM-graph model of Apollo uses deletion transitions instead of states. In the conventional model, visiting deletion states does not consume (i.e., emit) a2igure S2: l many insertion states for the base at location t . Here, the contig includes the bases C and T at the locations t and t + 1 , respectively. The corresponding match states are labeled with the bases thatthey correspond to. Each insertion state has an insertion transition to the next insertion state with theinitial probability α I and a match transition to the next match state at the location t + 1 with the initialprobability α M . However, the last insertion state, I lt , does not have a transition to the next insertionstate as it is the last one. Instead, it has a match transition to the next match state T t +1 with theprobability α M + α I . The emission probability of the base T is as it appears in the next position ( t + 1 )of the contig. The figure is taken from Hercules [1].Figure S3: Deletion transitions of the match and each insertion states at location t . For the match andinsertion states at location t , we show only the deletion transitions (red). Note that a deletion transitionfrom the position t to the match state of the position t + x + 1 removes x many bases with the probability α xD as it skips x many match states where ≤ x ≤ k . The figure is taken from Hercules [1].3haracter from a given sequence (i.e., observation). Therefore, this requires storing extra "position"information that tells which character should be consumed given a state at iteration i (i.e., in eachtransition from a state to another). We want to make sure that each state consumes only one character(and no more) when visited to prevent storing the extra position information. In Apollo’s design, iterationnumber i equals the position of a character that is being consumed Apollo’s states consume exactly onecharacter. This allows us to remove an entire dimension, the iteration number i , which greatly helps usto reduce both memory requirements and runtime while calculating the Forward-Backward values. Apollo uses the region of a pHMM-graph (i.e., sub-graph) that a read (i.e., observation or a sequence) isaligned to in order to calculate the likelihood of each state emitting a certain base at position t in thealigned read. However, this does not mean that position t is known since we need to consider the fact thatan unknown number of insertion and deletion errors may have occurred when k number of transitions isfollowed from the start state to a certain state. Therefore, states should be measuring the likelihood ofemitting a character at position t where t is a number in range [1 ...k ] where k is the number of transitionsthat was taken so far. In the no error case, we have k = t . Apollo uses reads as observations for theForward-Backward algorithm [3] in order to calculate the likelihoods per state. These likelihoods arecalculated based on initial transition and emission probabilities of a pHMM-graph and the read itself.Apollo uses these likelihoods to make the contig similar to the aligned read. Apollo, then, trains thepHMM-graph of a contig per each read that aligns to the contig using the Baum-Welch algorithm [3].We describe the details of both the Forward-Backward and the Baum-Welch algorithms in the followingparagraphs.For each read aligning to a contig, Apollo uses the alignment location and the sequence of the read inorder to train the pHMM-graph. First, per each aligned read sequence r , Apollo extracts the sub-graph G s ( V s , E s ) that corresponds to the aligned region of the contig where we have v start , v end , match andinsertion states, and the transitions as described in the Supplementary Section 1. Each transition fromstate i ∈ V s to state j ∈ V s , E ij ∈ E s , is associated with a transition probability α ij . For every pair ofstates, i ∈ V s and j ∈ V s , the transition probability α ij = 0 if E ij E s . Let us define the length of thealigned read, r , as m = | r | . Second, it calculates the forward and backward probabilities of each statebased on the aligned read, r .Let us assume that the forward probability of a state j that observes the t th base of the aligned read, r [ t ] , is F t ( j ) . For the forward probability, observing the t th base at the state j means that all the previousbases ( r [1] . . . r [ t − and < t ≤ m ) have been observed by following a path starting from the start stateto the state j and j observes the next base, r [ t ] . All possible transitions that lead to state j to observethe base r [ t ] contribute to the probability with (1) the forward probability of the origin state i calculatedwith the ( t − th base of r , F t − ( i ) , (2) multiplied by the probability of the transition from i to j , α ij ,(3) multiplied by the probability of emitting the base r [ t ] at state j , e j ( r [ t ]) .Let us denote the start state v start with the index value of (i.e., v start = 0 ). For each state j ∈ V s ,we calculate the forward probability, F t ( j ) , as follows where F ( j ) is the initialization step: F ( j ) = α j e j ( r [1]) s.t. j ∈ V s , E j ∈ E s (S2.1) F t ( j ) = X i ∈ V s F t − ( i ) α ij e j ( r [ t ]) j ∈ V s , < t ≤ m (S2.2)Let us assume that the backward probability of a state i that observes t th base of the aligned read, r [ t ] , is B t ( i ) . For the backward probability, observing the t th base at the state i means that all the furtherbases ( r [ t + 1] . . . r [ m ] and ≤ t < m ) have been observed by following a path starting from the end stateto the state i (backwards) and i observes the previous base, r [ t ] . All possible transitions that lead tostate i to observe the base r [ t ] contribute to the probability with (1) the backward probability of the nextstate j calculated with the ( t + 1) th base of r , B t +1 ( j ) , (2) multiplied by the probability of the transitionfrom i to j , α ij , (3) multiplied by the probability of emitting the base r [ t + 1] at state j , e j ( r [ t + 1]) .Let us denote the end state v end with the index value of m + 1 (i.e., v end = m + 1 ). For each state j ∈ V s , we calculate the backward probability, B t ( i ) , as follows where B m ( i ) is the initialization step: B m ( i ) = α i ( m +1) i ∈ V s , E i ( m +1) ∈ E s (S3.1)4 t ( i ) = X j ∈ V s α ij e j ( r [ t + 1]) B t +1 ( j ) j ∈ V s , ≤ t < m (S3.2)The calculations of forward and backward probabilities are referred as the Forward-Backward al-gorithm. After calculation of the forward and backward probabilities, Apollo uses the Baum-Welchalgorithm to train the pHMM-graph by calculating the posterior transition and the emission probabil-ities of the sub-graph, G s , as shown in equations S4 and S5, respectively. In equation S4, we use theIversonian brackets [4] to denote that [ r [ t ] = X ] is if the t th character of r is the same character as X .Otherwise, [ r [ t ] = X ] is . This structure helps us to perform the summation in the numerator only when the character at a position equals to the character given in function e ∗ i ( X ) (i.e., X ). We, then, normalizethis summation to make sure the sum of the emission probabilities that state i can have is equal to 1. e ∗ i ( X ) = m P t =1 F t ( i ) B t ( i )[ r [ t ] = X ] m P t =1 F t ( i ) B t ( i ) ∀ X ∈ { A, C, G, T } , ∀ i ∈ V s (S4) α ∗ ij = m − P t =1 α ij e j ( r [ t + 1]) F t ( i ) B t +1 ( j ) m − P t =1 P x ∈ V s α ix e x ( r [ t + 1]) F t ( i ) B t +1 ( x ) ∀ E ij ∈ E s (S5) As we explain in the Supplementary Section 2, for each read that aligns to the contig, Apollo extracts asub-graph G s and uses the Forward-Backward algorithm to train the sub-graph. It is highly possible thatthere can be overlaps between two or many sub-graphs such that the sub-graphs can include the samestates and the transitions when using high coverage reads. However, the updates on the overlappingstates and the transitions are exclusive between the sub-graphs such that no two update in separategraphs affect each other while calculating the Forward or the Backward probabilities. Each sub-graphuses the initial probabilities to calculate the posterior probabilities. In order to handle training of theoverlapping states and the transitions, Apollo takes the average of the posterior probabilities and reportsthe average probability as the final posterior probability for the entire pHMM-graph.Let us assume that the set of sub-graphs S includes the same state i ∈ V . For each G s in S , weobtain a e ∗ i ( X ) , where ∀ X ∈ Σ , which denotes the posterior emission probability as we explain in theSupplementary Section 2. We denote e ∗ i ( X ) that belongs to G s as e ∗ ,G s i ( X ) . Then, Apollo finds the finalemission value ˆ e i ( X ) as follows: ˆ e i ( X ) = P G s ∈ S e ∗ ,G s i ( X ) | S | ∀ X ∈ Σ (S6)Similarly, let us assume that the set of sub-graphs S includes the same transition edge E ij ∈ E . Foreach G s in S , we obtain an α ∗ ij that denotes the posterior transition value. We define α ∗ ij that belongsto G s as α ∗ ,G s ij . Apollo finds the final transition value ˆ α ij as follows: ˆ α ij = P G s ∈ S α ∗ ,G s ij | S | (S7)If a state in V or an edge in E is not covered by a read then Apollo retains the initial emission andtransition probabilities and uses as posterior probabilities, respectively.We would like to note that the Baum-Welch algorithm is also used to train conventional hiddenMarkov models (HMMs). In each observation, the Baum-Welch algorithm updates the transition andemission probabilities of an HMM accordingly. The initial probabilities of such HMMs may even beassigned randomly. This means that the order of the observations (i.e., training data), and the initialprobabilities used to train an HMM also affect the overall accuracy as the following observations usuallyuse the HMM that is trained based on earlier observations. Therefore, after using all the training data,an HMM may still have room to converge to a local optimal point due to the biases caused by the initial5robabilities and the order of the training data. The usual approach to mitigate such biases is to trainHMMs multiple times until the overall accuracy of an HMM converges to a certain point. We do not follow this strategy because of three reasons. First, Apollo does not set the initial transition and emissionprobabilities randomly. Instead, the probabilities are usually set according to the error profile of anassembly. Second, we use the initial probabilities each time a read is used to train the pHMM-graph sothe order of the training data does not matter. Third, Apollo is a very time consuming tool and takingmultiple iterations until convergence would significantly increase the overall runtime, which we want toavoid. Apollo uses the Viterbi algorithm [5] to reveal the polished assembly by finding the most likely pathstarting from the start state, v start , of the trained graph G to the end state, v end . For each state j ,the Viterbi algorithm calculates v t ( j ) , which is the maximum marginal forward probability j obtainedfrom following a path starting from the start state when decoding the t th base of the polished contig.Let X j ∈ Σ be the base that has the greatest emission probability for the state j , i.e., ˆ e j ( X j ) ≥ ˆ e j ( x ) , ∀ x ∈ Σ . Then, the value of v t ( j ) depends on 1) the transition probability from state i to the state j , ˆ α ij ,2) the Viterbi value of the state i when decoding the ( t − th base of the polished contig, v t − ( i ) , and 3)the emission probability of the base X j , ˆ e j ( X j ) . The Viterbi algorithm also keeps a back pointer, b t ( j ) ,which keeps track of the predecessor state i that yields the v t ( j ) value.Let T be the length of the decoded sequence, which is initially unknown. The algorithm recursivelycalculates v values for each position t of a decoded sequence as described in the equations S8.1 and S8.3.The algorithm stops at iteration T ∗ such that for the last iter iterations, the maximum value we haveobserved for v ( end ) cannot be improved and iter is set to 50 by default (empirically chosen). T is thenset to t ∗ such that v t ∗ ( end ) is the maximum among all iterations ≤ t ≤ T ∗ .1. Initialization v ( j ) = ˆ α start − j ˆ e j ( X j ) ∀ j ∈ V (S8.1) b ( j ) = start ∀ j ∈ V (S8.2)2. Recursion v t ( j ) = max i ∈ V v t − ( i )ˆ α ij ˆ e j ( X j ) ∀ j ∈ V, < t ≤ T (S8.3) b t ( j ) = argmax i ∈ V v t − ( i )ˆ α ij ˆ e j ( X j ) ∀ j ∈ V, < t ≤ T (S8.4)3. Termination v T ( end ) = max i ∈ V v T ( i )ˆ α i − end (S8.5) b T ( end ) = argmax i ∈ V v T ( i )ˆ α i − end (S8.6)The polished contig is generated by recursively following states from the end state, v end , at time T until the back pointer points back to the start state, v start , at time t = 1 for the state j as follows: Algorithm 1
Calculate contigt ← T − j ← end while t = 0 do j ← b t +1 ( j ) contig [ t ] ← X j t ← t − end whileprint contig Performance of the Assembly Polishing Algorithms
In Tables S1, S2, S3, S6 S9, and S12, we compare the assembly polishing performance of Apollo to thecompeting algorithms based on the difference between the assemblies and their reference genomes (i.e.,ground truth). In Tables S4, S7, S10, S13, and S15, we show the k-mer similarities between Illumina readsand the assemblies to provide an alignment-free comparison between the tools. We also use QUAST [6]to make a more detailed quality assessment of the assemblies in Tables S5, S8, S11, S14, and S16.Table S1: Comparison between using a hybrid set of reads with Apollo and running other polishing toolsmultiple times to polish a Canu-generated assembly
Dataset First Run Second Run Aligned Accuracy Polishing Runtime MemoryBases (%) Score (GB)
E. coli
O157 — — 99.94 0.9998 0.9992 43m 53s 3.79
E. coli
O157 Apollo (Hybrid) — 99.94 0.9999
8h 16m 08s 13.85
E. coli
O157 Racon (PacBio) Racon (Illumina) 99.94 0.9994 0.9988 21m 44s 22.65
E. coli
O157 Racon (PacBio) Racon (PacBio) 99.94 0.9984 0.9978 4m 58s 2.43
E. coli
O157 Pilon (Illumina) Pilon (Illumina) 99.94 0.9999
E. coli
O157 Pilon (Illumina) Racon (PacBio) 99.94 0.9986 0.9980 4m 58s 11.40
E. coli
O157 Quiver (PacBio) Quiver (Pacbio) 99.94 0.9998 0.9992 13m 06s
E. coli
O157 Quiver (PacBio) Pilon (Illumina) 99.94 0.9998 0.9992 5m 01s 7.50
E. coli
O157 Quiver (PacBio) Racon (PacBio) 99.94 0.9986 0.9980 5m 13s 2.48
E. coli
O157:H7 — — 100 0.9998 0.9998 43m 19s 3.39
E. coli
O157:H7 Apollo (Hybrid) — 100 0.9999
5h 58m 05s 8.86
E. coli
O157:H7 Racon (PacBio) Racon (Illumina) 100 0.9995 0.9995 9m 43s 6.56
E. coli
O157:H7 Racon (PacBio) Racon (PacBio) 100 0.9970 0.9970
5m 36s 2.24
E. coli
O157:H7 Pilon (Illumina) Pilon (Illumina) 100 0.9998 0.9998 35m 12s 10.79
E. coli
O157:H7 Pilon (Illumina) Racon (PacBio) 100 0.9996 0.9996 6m 04s 10.75
E. coli
K-12 — — 99.98 0.9794 0.9792 34h 21m 46s 5.06
E. coli
K-12 Apollo (Hybrid) — 99.99 0.9953 0.9952 9h 09m 50s 9.35
E. coli
K-12 Racon (ONT) Racon (Illumina) 100 0.9996
E. coli
K-12 Racon (ONT) Racon (ONT) 100 0.9851 0.9851 14m 45s
E. coli
K-12 Pilon (Illumina) Pilon (Illumina) 99.99 0.9993 0.9992 18m 55s 8.84
E. coli
K-12 Pilon (Illumina) Racon (ONT) 99.99 0.9997
15m 51s 8.84
E. coli
K-12 Nanopolish (ONT) Nanopolish (ONT) 99.98 0.9929 0.9927 25h 39m 17s 4.84
E. coli
K-12 Nanopolish (ONT) Pilon (Illumina) 99.99 0.9992 0.9991 9h 45m 01s 18.10
E. coli
K-12 Nanopolish (ONT) Racon (ONT) 100 0.9866 0.9866 9h 42m 24s 4.54Yeast S288C — — 99.89 0.9998 0.9987 1h 20m 39s 6.24Yeast S288C Apollo (Hybrid) — 99.89 0.9998
11h 08m 41s 6.38Yeast S288C Racon (PacBio) Racon (Illumina) 99.89 0.9994 0.9983 38m 21s 6.93Yeast S288C Racon (PacBio) Racon (PacBio) 99.89 0.9949 0.9938 49m 52s 6.93Yeast S288C Pilon (Illumina) Pilon (Illumina) 99.89 0.9998
Yeast S288C Quiver (PacBio) Pilon (Illumina) 98.95 0.9998 0.9893 12m 47s 13.28Yeast S288C Quiver (PacBio) Racon (PacBio) 98.93 0.9968 0.9861 40m 04s 6.69
We use the long reads of
E. coli
O157,
E. coli
O157:H7,
E. coli
K-12, and Yeast S288C datasets to generate their assemblieswith
Canu . Here, the polishing tools specified under
First Run and
Second Run polish the assembly using the set ofreads specified in parentheses. The set of reads used in the second run is aligned to the assembly polished in the first runusing Minimap2. PacBio and Illumina set of reads together constitute the hybrid set of reads (i.e.,
Hybrid ). We reportthe performance of the polishing tools in terms of the percentage of bases of an assembly that aligns to its reference (i.e.,
Aligned Bases ), the fraction of identical portions between the aligned bases of an assembly and the reference (i.e.,
Accuracy )as calculated by dnadiff, and
Polishing Score value that is the product of
Accuracy and
Aligned Bases (as a fraction). Wereport the runtime and the memory requirements of the assembly polishing tools. We show the best result among assemblypolishing algorithms for each performance metric in bold text.
Dataset First Run Second Run Aligned Accuracy Polishing Runtime MemoryBases (%) Score (GB)
E. coli
O157 — — 94.93 0.9000 0.8544 1m 48s 10.03
E. coli
O157 Apollo (Hybrid) — 98.70 0.9866 0.9738 3h 51m 51s 12.08
E. coli
O157 Racon (PacBio) Racon (Illumina) 99.37 0.9992 0.9929 21m 19s 22.66
E. coli
O157 Racon (PacBio) Racon (PacBio) 99.51 0.9980 0.9931
5m 00s 2.46
E. coli
O157 Pilon (Illumina) Pilon (Illumina) 96.88 0.9872 0.9564 34m 53s 18.60
E. coli
O157 Pilon (Illumina) Racon (PacBio) 98.87 0.9970 0.9857 35m 26s 18.60
E. coli
O157 Quiver (PacBio) Quiver (PacBio) 99.85 0.9994
13m 45s 5.05
E. coli
O157 Quiver (PacBio) Pilon (Illumina) 99.80 0.9994 0.9974 9m 42s 4.76
E. coli
O157 Quiver (PacBio) Racon (PacBio) 99.81 0.9984 0.9965 10m 29s 2.49
E. coli
O157:H7 — — 88.56 0.8798 0.7792 2m 57s 6.27
E. coli
O157:H7 Apollo (Hybrid) — 97.53 0.9804 0.9562 2h 54m 55s 8.34
E. coli
O157:H7 Racon (PacBio) Racon (Illumina) 99.02 0.9991
9m 24s 6.56
E. coli
O157:H7 Racon (PacBio) Racon (PacBio) 99.22 0.9954 0.9876
5m 31s 2.24
E. coli
O157:H7 Racon (PacBio) Pilon (Illumina) 99.12 0.9981
20m 37s 12.57
E. coli
O157:H7 Pilon (Illumina) Pilon (Illumina) 96.32 0.9896 0.9532 35m 12s 15.84
E. coli
K-12 — — 86.68 0.8503 0.7370 4m 04s 16.47
E. coli
K-12 Apollo (Hybrid) — 97.53 0.9419 0.9186 2h 18m 33s 9.12
E. coli
K-12 Racon (ONT) Racon (Illumina) 99.51 0.9992
E. coli
K-12 Racon (ONT) Racon (ONT) 99.78 0.9840 0.9818 11m 43s
E. coli
K-12 Pilon (Illumina) Pilon (Illumina) 89.61 0.9622 0.8622 32m 03s 17.78
E. coli
K-12 Pilon (Illumina) Racon (ONT) 99.43 0.9979 0.9922 25m 15s 32.15
E. coli
K-12 Nanopolish (ONT) Nanopolish (ONT) 97.35 0.9488 0.9236 241h 56m 10s 8.49
E. coli
K-12 Nanopolish (ONT) Pilon (Illumina) 96.48 0.9769 0.9425 117h 29m 47s 32.15
E. coli
K-12 Nanopolish (ONT) Racon (ONT) 99.62 0.9814 0.9776 117h 08m 16s 8.49Yeast S288C — — 95.05 0.8923 0.8481 2m 20s 16.59Yeast S288C Apollo (Hybrid) — 98.49 0.9709 0.9562 6h 37m 46s 5.96Yeast S288C Racon (PacBio) Racon (Illumina) 99.26 0.9986 0.9912 23m 51s 6.75Yeast S288C Racon (PacBio) Racon (PacBio) 99.33 0.9937 0.9879 43m 00s 6.75Yeast S288C Racon (PacBio) Pilon (Illumina) 99.23 0.9977 0.9900 22m 07s 14.86Yeast S288C Pilon (Illumina) Pilon (Illumina) 95.80 0.9595 0.9192
2m 35s
Yeast S288C Quiver (PacBio) Pilon (Illumina) 99.45 0.9996
12m 23s 13.40Yeast S288C Quiver (PacBio) Racon (PacBio) 99.50 0.9965 0.9915 29m 31s 6.39
We use the long reads of
E. coli
O157,
E. coli
O157:H7,
E. coli
K-12, and Yeast S288C datasets to generate their assemblieswith
Miniasm . The polishing tools specified under
First Run and
Second Run polish the assembly using the set of readsspecified in parentheses. The set of reads used in the second run is aligned to the assembly polished in the first runusing Minimap2. PacBio and Illumina set of reads together constitute the hybrid set of reads (i.e.,
Hybrid ). We reportthe performance of the polishing tools in terms of the percentage of bases of an assembly that aligns to its reference (i.e.,
Aligned Bases ), the fraction of identical portions between the aligned bases of an assembly and the reference (i.e.,
Accuracy )as calculated by dnadiff, and
Polishing Score value that is the product of
Accuracy and
Aligned Bases (as a fraction). Wereport the runtime and the memory requirements of the assembly polishing tools. We show the best result among assemblypolishing algorithms for each performance metric in bold text.
E. coli
O157 dataset
Dataset Assembler Aligner Sequencing Tech. Polishing Aligned Accuracy Polishing Runtime Memoryof the Reads Algorithm Bases (%) Score (GB)
PacBio Miniasm — — — 94.93 0.9000 0.8544 1m 48s 10.03PacBio Miniasm Minimap2 PacBio Apollo 98.49 0.9798 0.9650 2h 27m 49s 7.07PacBio Miniasm Minimap2 PacBio Pilon 96.43 0.9528 0.9188 1h 31m 32s 17.68PacBio Miniasm Minimap2 PacBio Racon 99.35 0.9951
PacBio Miniasm pbalign PacBio Quiver 99.80 0.9993
PacBio Miniasm Minimap2 Illumina Apollo 97.61 0.9816
4h 25m 17s
PacBio Miniasm Minimap2 Illumina Pilon 96.52 0.9775 0.9435 32m 48s 18.60PacBio Miniasm Minimap2 Illumina Racon 96.45 0.9876 0.9525
14m 09s
PacBio Miniasm BWA-MEM Illumina Pilon 96.13 0.9693 0.9318 31m 21s 18.45PacBio Miniasm BWA-MEM Illumina Racon 96.90 0.9813
3h 42m 03s 8.82PacBio Canu Minimap2 PacBio Racon 99.94 0.9986 0.9980
2m 17s 2.34
PacBio Canu pbalign PacBio Quiver 99.94 0.9998
PacBio Canu BWA-MEM Illumina Apollo 99.94 0.9999
4h 49m 15s
PacBio Canu BWA-MEM Illumina Pilon 99.94 0.9998 0.9992
2m 05s
14m 58s 21.04PacBio (30 × ) Miniasm ∗ — — — — — — — —PacBio (30 × ) Canu — — — 99.98 0.9981 0.9979 21m 03s 3.70PacBio (30 × ) Canu Minimap2 PacBio (30 × ) Apollo 99.98 0.9982
43m 32s 8.00PacBio (30 × ) Canu Minimap2 PacBio (30 × ) Racon 99.98 0.9980 0.9978
15s 0.59
PacBio (30 × ) Canu Minimap2 PacBio (30 × , Corr.) Apollo 99.97 0.9976 0.9973 46m 10s 7.99PacBio (30 × ) Canu Minimap2 PacBio (30 × , Corr.) Racon 99.98 0.9983 PacBio (30 × ) Canu BWA-MEM Illumina Apollo 99.98 0.9997 0.9995 4h 48m 31s 10.35PacBio (30 × ) Canu BWA-MEM Illumina Pilon 99.98 0.9998 PacBio (30 × ) Canu BWA-MEM Illumina Racon 99.98 0.9997 0.9995 14m 42s 21.04 We polish the PacBio assemblies of
E. coli
O157 for different combinations of sequencing technology, assembler, aligner, andpolishing algorithm. Canu-corrected long reads are labeled as "Corr.". We report the performance of the tools in terms ofpercentage of bases of an assembly that aligns to its reference (i.e.,
Aligned Bases ), the fraction of identical portions betweenthe aligned bases of an assembly and the reference (i.e.,
Accuracy ) as calculated by dnadiff, and a
Polishing Score value thatis the product of
Accuracy and
Aligned Bases (as a fraction). We report the runtime and the memory requirements of theassembly polishing tools. For the rows that do not specify assembly polishing algorithms, we only report the runtime andthe memory requirements of the assemblers as well as accuracy of the unpolished assembly that they construct. We showthe best result among assembly polishing algorithms for each performance metric in bold text. ∗ denotes that Miniasmcannot produce an assembly given the specified set of reads. E. coli
O157 assemblies
Dataset Assembler Aligner Sequencing Tech. Polishing 11-mer 21-mer 31-mer 51-merof the Reads Algorithm Sim. (%) Sim. (%) Sim. (%) Sim. (%)
PacBio Reference — — — 100 / 100 99.89 / 99.98 99.92 / 99.96 99.66 / 99.96PacBio Miniasm — — — 90.67 / 83.48 14.31 / 13.53 5.61 / 5.21 1.12 / 1.04PacBio Miniasm Minimap2 PacBio Apollo 96.19 / 94.94 76.20 / 74.70 66.76 / 64.01 54.77 / 52.38PacBio Miniasm Minimap2 PacBio Pilon 93.63 / 89.91 46.18 / 44.24 31.07 / 28.92 14.57 / 13.70PacBio Miniasm Minimap2 PacBio Racon 99.47 / 98.70 94.89 / 94.11 91.11 / 89.05 85.22 / 84.67PacBio Miniasm pbalign PacBio Quiver 100 / 99.61 99.81 / 99.06 99.65 / 98.41 99.16 / 98.31PacBio Miniasm Minimap2 Illumina Apollo 97.11 / 95.42 83.33 / 82.33 78.23 / 76.56 71.05 / 69.02PacBio Miniasm Minimap2 Illumina Pilon 96.52 / 93.93 83.74 / 80.15 82.25 / 77.44 79.02 / 74.49PacBio Miniasm Minimap2 Illumina Racon 97.31 / 96.42 90.35 / 90.02 88.61 / 87.88 87.98 / 87.34PacBio Miniasm BWA-MEM Illumina Apollo 96.98 / 94.19 80.06 / 77.20 75.18 / 72.08 67.71 / 64.42PacBio Miniasm BWA-MEM Illumina Pilon 96.32 / 93.20 79.65 / 75.30 76.75 / 72.32 72.92 / 67.16PacBio Miniasm BWA-MEM Illumina Racon 96.91 / 95.10 85.89 / 85.27 84.00 / 83.88 82.36 / 81.06PacBio Canu — — — 100 / 99.93 99.63 / 99.78 99.46 / 99.42 98.93 / 99.00PacBio Canu Minimap2 PacBio Apollo 100 / 99.93 99.50 / 99.74 99.17 / 99.50 98.50 / 99.11PacBio Canu Minimap2 PacBio Racon 99.87 / 99.74 98.44 / 98.52 97.37 / 97.39 95.63 / 95.78PacBio Canu pbalign PacBio Quiver 100 / 100 99.80 / 99.72 99.67 / 99.44 99.40 / 99.25PacBio Canu BWA-MEM Illumina Apollo 100 / 100 99.83 / 99.91 99.73 / 99.77 99.59 / 99.61PacBio Canu BWA-MEM Illumina Pilon 100 / 100 99.83 / 99.93 99.73 / 99.77 99.59 / 99.62PacBio Canu BWA-MEM Illumina Racon 100 / 100 99.81 / 99.91 99.71 / 99.75 99.57 / 99.53PacBio (30 × ) Canu — — — 99.47 / 99.41 96.74 / 96.88 95.20 / 94.92 92.31 / 91.39PacBio (30 × ) Canu Minimap2 PacBio (30 × ) Apollo 99.61 / 99.41 97.04 / 97.40 95.41 / 95.63 92.67 / 92.48PacBio (30 × ) Canu Minimap2 PacBio (30 × ) Racon 99.80 / 99.61 97.00 / 97.34 95.12 / 95.16 92.63 / 92.98PacBio (30 × ) Canu Minimap2 PacBio (30 × , Corr.) Apollo 99.41 / 99.41 97.00 / 97.47 95.31 / 95.72 92.44 / 92.93PacBio (30 × ) Canu Minimap2 PacBio (30 × , Corr.) Racon 99.67 / 99.48 97.48 / 98.19 96.00 / 96.56 93.12 / 94.07PacBio (30 × ) Canu BWA-MEM Illumina Apollo 100 / 99.93 99.83 / 99.54 99.69 / 99.52 99.55 / 99.23PacBio (30 × ) Canu BWA-MEM Illumina Pilon 100 / 99.93 99.83 / 99.70 99.69 / 99.58 99.51 / 99.31PacBio (30 × ) Canu BWA-MEM Illumina Racon 100 / 99.93 99.89 / 99.63 99.73 / 99.62 99.55 / 99.29 We report the pairs of the percentage of 1) k-mers of Illumina reads present in the assembly and 2) k-mers of the assemblypresent in the Illumina reads (separated by “/") in k-mer Sim. using a fixed k-mer size (i.e., k ∈ { , , , } ). Wegenerate the assemblies for the Dataset s using the reads sequenced from PacBio. We use Canu and Miniasm assemblers asspecified in
Assembler . The reads specified under
Sequencing Tech. of the Reads are sequenced by the specified sequencingtechnology and are aligned to the assembly using the
Aligner . For the rows that do not specify assembly polishing algorithms,we only report the k-mer similarity between Illumina set of reads and either the unpolished assembly or the reference.
Table S5: Quality assessment of the
E. coli
O157 assemblies
Dataset Assembler Aligner Sequencing Tech. Polishing GC Mapped Properly Avg. Coverageof the Reads Algorithm (%) Reads (%) Paired (%) Coverage ≥ × (%) PacBio Reference — — — 50.48 99.92 99.49 564 99.94PacBio Miniasm — — — 49.88 92.08 87.50 434 87.90PacBio Miniasm Minimap2 PacBio Apollo 50.28 98.74 97.43 531 96.19PacBio Miniasm Minimap2 PacBio Pilon 50.14 99.17 97.20 526 93.78PacBio Miniasm Minimap2 PacBio Racon 50.52 99.63 99.03 542 98.35PacBio Miniasm pbalign PacBio Quiver 50.56 99.83 99.40 545 98.56PacBio Miniasm Minimap2 Illumina Apollo 50.37 96.49 94.60 513 93.74PacBio Miniasm Minimap2 Illumina Pilon 50.36 95.58 92.04 499 89.57PacBio Miniasm Minimap2 Illumina Racon 50.45 96.48 94.73 514 94.11PacBio Miniasm BWA-MEM Illumina Apollo 50.30 95.55 92.22 498 89.58PacBio Miniasm BWA-MEM Illumina Pilon 50.30 94.48 89.64 478 86.54PacBio Miniasm BWA-MEM Illumina Racon 50.37 94.63 90.69 508 90.76PacBio Canu — — — 50.36 99.90 99.46 547 99.73PacBio Canu Minimap2 PacBio Apollo 50.36 99.90 99.46 547 99.92PacBio Canu Minimap2 PacBio Racon 50.35 99.89 99.44 547 99.89PacBio Canu pbalign PacBio Quiver 50.36 99.90 99.46 547 99.38PacBio Canu BWA-MEM Illumina Apollo 50.36 99.90 99.46 547 99.73PacBio Canu BWA-MEM Illumina Pilon 50.36 99.90 99.46 547 99.73PacBio Canu BWA-MEM Illumina Racon 50.36 99.90 99.46 547 99.73PacBio (30 × ) Canu — — — 50.44 99.89 99.42 560 99.61PacBio (30 × ) Canu Minimap2 PacBio (30 × ) Apollo 50.46 99.89 99.44 560 99.91PacBio (30 × ) Canu Minimap2 PacBio (30 × ) Racon 50.44 99.89 99.43 560 99.94PacBio (30 × ) Canu Minimap2 PacBio (30 × , Corr.) Apollo 50.46 99.89 99.42 560 99.92PacBio (30 × ) Canu Minimap2 PacBio (30 × , Corr.) Racon 50.46 99.89 99.42 560 99.97PacBio (30 × ) Canu BWA-MEM Illumina Apollo 50.47 99.89 99.44 560 99.70PacBio (30 × ) Canu BWA-MEM Illumina Pilon 50.47 99.89 99.44 560 99.71PacBio (30 × ) Canu BWA-MEM Illumina Racon 50.47 99.89 99.43 560 99.69 We report the quality assessment of the assemblies as reported by QUAST [6]. QUAST reports the GC content and usesthe filtered Illumina reads to measure 1) percentage of the short reads that mapped to the assembly ( Mapped Reads ), 2)percentage of
Properly Paired reads that mapped within the expected range of each other to the assembly, 3) averagedepth of coverage (
Avg. Coverage ), and 4) percentage of the bases with at least 10 × coverage ( Coverage ≥ × ). Wegenerate the assemblies for the Dataset s using the reads sequenced from PacBio. We use Canu and Miniasm assemblers asspecified in
Assembler . The reads specified under
Sequencing Tech. of the Reads are sequenced by the specified sequencingtechnology and are aligned to the assembly using the
Aligner to polish the assembly. For the rows that do not specifyassembly polishing algorithms, we only report the quality assessment of either the unpolished assembly or the reference.
E. coli
O157:H7 dataset
Dataset Assembler Aligner Sequencing Tech. Polishing Aligned Accuracy Polishing Runtime Memoryof the Reads Algorithm Bases (%) Score (GB)
PacBio Miniasm — — — 88.56 0.8798 0.7792 2m 57s 6.27PacBio Miniasm Minimap2 PacBio Apollo 96.99 0.9636 0.9346 1h 10m 23s 7.07PacBio Miniasm Minimap2 PacBio Racon 98.94 0.9899
PacBio Miniasm Minimap2 Illumina Apollo 96.06 0.9781 0.9396 2h 17m 28s
PacBio Miniasm Minimap2 Illumina Pilon 95.09 0.9791 0.9310 28m 54s 15.84PacBio Miniasm Minimap2 Illumina Racon 96.17 0.9883
2h 57m 18s 7.58PacBio Canu Minimap2 PacBio Racon 100 0.9975 0.9975
2m 50s 2.23
PacBio Canu Minimap2 Illumina Apollo 100 0.9997 0.9997 3h 10m 16s
PacBio Canu Minimap2 Illumina Pilon 100 0.9999
We polish the PacBio assemblies of
E. coli
O157:H7 for different combinations of sequencing technology, assembler, aligner,and polishing algorithm. Canu-corrected long reads are labeled as "Corr.". We report the performance of the tools in termsof percentage of bases of an assembly that aligns to its reference (i.e.,
Aligned Bases ), the fraction of identical portionsbetween the aligned bases of an assembly and the reference (i.e.,
Accuracy ) as calculated by dnadiff, and a
PolishingScore value that is the product of
Accuracy and
Aligned Bases (as a fraction). We report the runtime and the memoryrequirements of the assembly polishing tools. For the rows that do not specify assembly polishing algorithms, we onlyreport the runtime and the memory requirements of the assemblers as well as accuracy of the unpolished assembly thatthey construct. We show the best result among assembly polishing algorithms for each performance metric in bold text. ∗ denotes that Miniasm cannot produce an assembly given the specified set of reads. Table S7: K-mer similarity between Illumina reads and the
E. coli
O157:H7 assemblies
Dataset Assembler Aligner Sequencing Tech. Polishing 11-mer 21-mer 31-mer 51-merof the Reads Algorithm Sim. (%) Sim. (%) Sim. (%) Sim. (%)
E. coli
O157:H7 Reference — — — 99.93 / 100 99.78 / 99.94 99.73 / 99.96 99.70 / 99.92
E. coli
O157:H7 Miniasm — — — 91.14 / 81.04 9.01 / 7.94 3.25 / 2.74 0.37 / 0.33
E. coli
O157:H7 Miniasm Minimap2 PacBio Apollo 96.46 / 91.36 61.52 / 57.92 52.73 / 48.27 35.22 / 32.38
E. coli
O157:H7 Miniasm Minimap2 PacBio Racon 98.10 / 96.95 88.45 / 85.70 84.37 / 80.22 74.61 / 70.87
E. coli
O157:H7 Miniasm Minimap2 Illumina Apollo 97.97 / 93.43 81.92 / 78.79 77.05 / 72.69 66.69 / 63.21
E. coli
O157:H7 Miniasm Minimap2 Illumina Pilon 97.64 / 92.25 85.57 / 79.87 84.74 / 78.02 80.92 / 75.85
E. coli
O157:H7 Miniasm Minimap2 Illumina Racon 98.36 / 94.57 91.28 / 89.04 90.77 / 87.49 88.78 / 87.23
E. coli
O157:H7 Canu — — — 99.80 / 99.93 99.41 / 99.57 99.13 / 99.46 99.06 / 98.99
E. coli
O157:H7 Canu Minimap2 PacBio Apollo 99.80 / 99.93 99.35 / 99.57 99.08 / 99.44 98.82 / 98.88
E. coli
O157:H7 Canu Minimap2 PacBio Racon 99.54 / 99.61 96.81 / 96.67 95.27 / 95.22 91.84 / 91.72
E. coli
O157:H7 Canu Minimap2 Illumina Apollo 99.93 / 99.93 99.48 / 99.85 99.17 / 99.71 98.95 / 99.70
E. coli
O157:H7 Canu Minimap2 Illumina Pilon 99.87 / 100 99.78 / 99.91 99.73 / 99.88 99.70 / 99.79
E. coli
O157:H7 Canu Minimap2 Illumina Racon 99.80 / 100 99.31 / 99.83 99.00 / 99.88 98.63 / 99.47
We report the pairs of the percentage of 1) k-mers of Illumina reads present in the assembly and 2) k-mers of the assemblypresent in the Illumina reads (separated by “/") in k-mer Sim. using a fixed k-mer size (i.e., k ∈ { , , , } ). Wegenerate the assemblies for the Dataset s using the reads sequenced from PacBio. We use Canu and Miniasm assemblers asspecified in
Assembler . The reads specified under
Sequencing Tech. of the Reads are sequenced by the specified sequencingtechnology and are aligned to the assembly using the
Aligner . For the rows that do not specify assembly polishing algorithms,we only report the k-mer similarity between Illumina set of reads and either the unpolished assembly or the reference.
E. coli
O157:H7 assemblies
Dataset Assembler Aligner Sequencing Tech. Polishing GC Mapped Properly Avg. Coverageof the Reads Algorithm (%) Reads (%) Paired (%) Coverage ≥ × (%) E. coli
O157:H7 Reference — — — 50.43 97.42 94.3 183 99.93
E. coli
O157:H7 Miniasm — — — 49.61 80.51 68.24 108 76.01
E. coli
O157:H7 Miniasm Minimap2 PacBio Apollo 50.09 95.0 88.69 163 91.74
E. coli
O157:H7 Miniasm Minimap2 PacBio Racon 50.55 97.03 93.06 173 96.59
E. coli
O157:H7 Miniasm Minimap2 Illumina Apollo 50.39 93.6 87.69 162 90.65
E. coli
O157:H7 Miniasm Minimap2 Illumina Pilon 50.36 93.01 85.66 159 86.75
E. coli
O157:H7 Miniasm Minimap2 Illumina Racon 50.48 93.84 88.52 163 91.67
E. coli
O157:H7 Canu — — — 50.43 97.42 94.32 182 99.71
E. coli
O157:H7 Canu Minimap2 PacBio Apollo 50.44 97.42 94.32 182 99.87
E. coli
O157:H7 Canu Minimap2 PacBio Racon 50.41 97.4 94.22 182 99.73
E. coli
O157:H7 Canu Minimap2 Illumina Apollo 50.45 97.42 94.31 182 99.95
E. coli
O157:H7 Canu Minimap2 Illumina Pilon 50.44 97.42 94.33 182 99.71
E. coli
O157:H7 Canu Minimap2 Illumina Racon 50.45 97.42 94.29 182 99.98
We report the quality assessment of the assemblies as reported by QUAST [6]. QUAST reports the GC content and usesthe filtered Illumina reads to measure 1) percentage of the short reads that mapped to the assembly ( Mapped Reads ), 2)percentage of
Properly Paired reads that mapped within the expected range of each other to the assembly, 3) averagedepth of coverage (
Avg. Coverage ), and 4) percentage of the bases with at least 10 × coverage ( Coverage ≥ × ). Wegenerate the assemblies for the Dataset s using the reads sequenced from PacBio. We use Canu and Miniasm assemblers asspecified in
Assembler . The reads specified under
Sequencing Tech. of the Reads are sequenced by the specified sequencingtechnology and are aligned to the assembly using the
Aligner to polish the assembly. For the rows that do not specifyassembly polishing algorithms, we only report the quality assessment of either the unpolished assembly or the reference.
Table S9: Assembly polishing performance of the tools for
E. coli
K-12 MG1655 dataset
Dataset Assembler Aligner Sequencing Tech. Polishing Aligned Accuracy Polishing Runtime Memoryof the Reads Algorithm Bases (%) Score (GB)
ONT Miniasm — — — 86.68 0.8503 0.7370 4m 04s 16.47ONT Miniasm Minimap2 ONT Apollo 97.50 0.9209 0.8979 1h 40m 08s 7.96ONT Miniasm Minimap2 ONT Nanopolish 96.01 0.9182 0.8816 117h 02m 10s 8.49ONT Miniasm Minimap2 ONT Racon 99.41 0.9769
ONT Miniasm Minimap2 Illumina Apollo 89.41 0.9291
54m 46s
ONT Miniasm Minimap2 Illumina Pilon 89.22 0.9310 0.8306
17m 28s
9h 35m 26s 4.54ONT Canu Minimap2 ONT Racon 100 0.9840 0.9840
7m 22s 4.20
ONT Canu Minimap2 Illumina Apollo 99.96 0.9982 0.9978 2h 09m 47s
ONT Canu Minimap2 Illumina Pilon 99.99 0.9987 × ) Miniasm ∗ — — — — — — — —ONT (30 × ) Canu — — — 99.98 0.9744 0.9742 3h 17m 47s 4.54ONT (30 × ) Canu Minimap2 ONT (30 × ) Apollo 99.98 0.9752 0.9750 40m 37s 7.74ONT (30 × ) Canu Minimap2 ONT (30 × ) Nanopolish 99.99 0.9857
4h 07m 06s 2.15ONT (30 × ) Canu Minimap2 ONT (30 × ) Racon 100 0.9825 0.9825
20s 0.59
ONT (30 × ) Canu Minimap2 ONT (30 × , Corr) Apollo 99.96 0.9755 0.9751 46m 40s 7.75ONT (30 × ) Canu Minimap2 ONT (30 × , Corr) Racon 100 0.9799 We polish the ONT assemblies of
E. coli
K-12 MG1655 for different combinations of assembler and polishing algorithm.Canu-corrected long reads are labeled as "Corr.". We report the performance of the tools in terms of percentage of basesof an assembly that aligns to its reference (i.e.,
Aligned Bases ), the fraction of identical portions between the aligned basesof an assembly and the reference (i.e.,
Accuracy ) as calculated by dnadiff, and a
Polishing Score value that is the productof
Accuracy and
Aligned Bases (as a fraction). We report the runtime and the memory requirements of the assemblypolishing tools. For the rows that do not specify assembly polishing algorithms, we only report the runtime and thememory requirements of the assemblers as well as accuracy of the unpolished assembly that they construct. We show thebest result among assembly polishing algorithms for each performance metric in bold text. ∗ denotes that Miniasm cannotproduce an assembly given the specified set of reads. E. coli
K-12 assemblies
Dataset Assembler Aligner Sequencing Tech. Polishing 11-mer 21-mer 31-mer 51-merof the Reads Algorithm Sim. (%) Sim. (%) Sim. (%) Sim. (%)
ONT Reference — — — 99.79 / 100 99.37 / 99.70 99.35 / 99.51 99.22 / 99.65ONT Miniasm — — — 82.92 / 80.97 13.49 / 14.57 5.40 / 5.59 1.22 / 1.29ONT Miniasm Minimap2 ONT Apollo 88.09 / 87.01 39.46 / 41.06 26.10 / 27.06 12.11 / 12.20ONT Miniasm Minimap2 ONT Nanopolish 89.67 / 87.09 47.47 / 48.79 38.19 / 37.81 25.04 / 25.30ONT Miniasm Minimap2 ONT Racon 93.25 / 95.02 75.24 / 74.16 63.69 / 63.36 48.72 / 47.87ONT Miniasm Minimap2 Illumina Apollo 91.25 / 87.17 50.96 / 53.20 44.37 / 44.28 32.54 / 32.75ONT Miniasm Minimap2 Illumina Pilon 89.60 / 86.45 56.27 / 58.38 51.30 / 52.20 44.28 / 45.29ONT Canu — — — 92.08 / 95.91 76.08 / 76.15 66.05 / 66.09 49.94 / 49.87ONT Canu Minimap2 ONT Apollo 92.15 / 95.91 76.93 / 77.04 67.52 / 67.37 51.35 / 50.94ONT Canu Minimap2 ONT Nanopolish 97.04 / 98.60 90.74 / 91.32 86.95 / 86.20 79.33 / 78.49ONT Canu Minimap2 ONT Racon 94.49 / 96.89 80.33 / 80.42 72.03 / 71.58 57.39 / 56.79ONT Canu Minimap2 Illumina Apollo 99.24 / 99.65 97.72 / 97.88 97.35 / 96.94 96.26 / 95.82ONT Canu Minimap2 Illumina Pilon 99.59 / 99.59 98.10 / 98.39 98.37 / 97.70 97.06 / 96.46ONT (30 × ) Canu — — — 90.36 / 94.87 71.60 / 72.15 59.89 / 59.83 41.94 / 42.42ONT (30 × ) Canu Minimap2 ONT (30 × ) Apollo 91.05 / 94.84 72.62 / 73.06 60.96 / 61.17 43.50 / 43.84ONT (30 × ) Canu Minimap2 ONT (30 × ) Nanopolish 95.94 / 96.80 83.00 / 82.30 75.37 / 73.76 61.85 / 60.96ONT (30 × ) Canu Minimap2 ONT (30 × ) Racon 93.73 / 96.46 79.13 / 78.91 68.55 / 68.62 53.17 / 53.00ONT (30 × ) Canu Minimap2 ONT (30 × , Corr) Apollo 91.05 / 94.97 72.64 / 73.56 61.21 / 62.11 42.89 / 43.54ONT (30 × ) Canu Minimap2 ONT (30 × , Corr) Racon 92.08 / 95.91 74.79 / 76.08 64.47 / 65.09 47.06 / 47.38 We report the pairs of the percentage of 1) k-mers of Illumina reads present in the assembly and 2) k-mers of the assemblypresent in the Illumina reads (separated by “/") in k-mer Sim. using a fixed k-mer size (i.e., k ∈ { , , , } ). Wegenerate the assemblies for the Dataset s using the reads sequenced from PacBio. We use Canu and Miniasm assemblers asspecified in
Assembler . The reads specified under
Sequencing Tech. of the Reads are sequenced by the specified sequencingtechnology and are aligned to the assembly using the
Aligner . For the rows that do not specify assembly polishing algorithms,we only report the k-mer similarity between Illumina set of reads and either the unpolished assembly or the reference.
Table S11: Quality assessment of the
E. coli
K-12 assemblies
Dataset Assembler Aligner Sequencing Tech. Polishing GC Mapped Properly Avg. Coverageof the Reads Algorithm (%) Reads (%) Paired (%) Coverage ≥ × (%) ONT Reference — — — 50.79 99.70 98.96 237 99.55ONT Miniasm — — — 52.62 90.85 82.50 147 75.72ONT Miniasm Minimap2 ONT Apollo 52.23 97.44 94.28 216 94.84ONT Miniasm Minimap2 ONT Nanopolish 52.10 96.97 90.32 200 90.35ONT Miniasm Minimap2 ONT Racon 51.12 99.09 97.71 234 98.51ONT Miniasm Minimap2 Illumina Apollo 51.89 92.90 86.52 181 80.33ONT Miniasm Minimap2 Illumina Pilon 52.11 92.59 85.77 175 78.64ONT Canu — — — 51.05 99.61 98.71 233 98.75ONT Canu Minimap2 ONT Apollo 50.90 99.67 98.57 234 98.31ONT Canu Minimap2 ONT Nanopolish 51.04 99.66 98.83 234 98.77ONT Canu Minimap2 ONT Racon 51.01 99.65 98.75 234 99.24ONT Canu Minimap2 Illumina Apollo 50.81 99.68 98.80 235 98.58ONT Canu Minimap2 Illumina Pilon 50.80 99.68 98.77 235 98.76ONT (30 × ) Canu — — — 51.11 99.60 98.57 234 99.04ONT (30 × ) Canu Minimap2 ONT (30 × ) Apollo 51.14 99.60 98.59 234 99.19ONT (30 × ) Canu Minimap2 ONT (30 × ) Nanopolish 51.12 99.65 98.72 235 98.92ONT (30 × ) Canu Minimap2 ONT (30 × ) Racon 51.05 99.64 98.78 234 99.35ONT (30 × ) Canu Minimap2 ONT (30 × , Corr) Apollo 51.14 99.60 98.65 234 99.28ONT (30 × ) Canu Minimap2 ONT (30 × , Corr) Racon 51.08 99.63 98.80 234 99.40 We report the quality assessment of the assemblies as reported by QUAST [6]. QUAST reports the GC content and usesthe filtered Illumina reads to measure 1) percentage of the short reads that mapped to the assembly ( Mapped Reads ), 2)percentage of
Properly Paired reads that mapped within the expected range of each other to the assembly, 3) averagedepth of coverage (
Avg. Coverage ), and 4) percentage of the bases with at least 10 × coverage ( Coverage ≥ × ). Wegenerate the assemblies for the Dataset s using the reads sequenced from PacBio. We use Canu and Miniasm assemblers asspecified in
Assembler . The reads specified under
Sequencing Tech. of the Reads are sequenced by the specified sequencingtechnology and are aligned to the assembly using the
Aligner to polish the assembly. For the rows that do not specifyassembly polishing algorithms, we only report the quality assessment of either the unpolished assembly or the reference.
Dataset Assembler Aligner Sequencing Tech. Polishing Aligned Accuracy Polishing Runtime Memoryof the Reads Algorithm Bases (%) Score (GB)
PacBio Miniasm — — — 95.05 0.8923 0.8481 2m 23s 16.59PacBio Miniasm Minimap2 PacBio Apollo 98.44 0.9706 0.9555 6h 53m 51s 4.62PacBio Miniasm Minimap2 PacBio Racon 99.15 0.9895 0.9811 18m 55s 6.63PacBio Miniasm Minimap2 PacBio Quiver 99.44 0.9995
PacBio Miniasm Minimap2 Illumina Apollo 97.26 0.9733 0.9466 2h 05m 58s
PacBio Miniasm Minimap2 Illumina Pilon 97.06 0.9761 0.9474
4m 00s
5m 00s 7.34PacBio Canu — — — 99.89 0.9998 0.9987 1h 20m 39s 6.24PacBio Canu Minimap2 PacBio Apollo 98.95 0.9997 0.9892 10h 59m 10s 5.05PacBio Canu Minimap2 PacBio Racon 98.93 0.9964 0.9857 19m 16s 6.82PacBio Canu Minimap2 PacBio Quiver 98.95 0.9998
PacBio Canu Minimap2 Illumina Apollo 98.95 0.9998
1h 22m 24s
PacBio Canu Minimap2 Illumina Pilon 98.95 0.9998
2m 55s 5.15
We polish the PacBio assemblies of Yeast S288C for different combinations of sequencing technology, assembler, aligner, andpolishing algorithm. Canu-corrected long reads are labeled as "Corr.". We report the performance of the tools in terms ofpercentage of bases of an assembly that aligns to its reference (i.e.,
Aligned Bases ), the fraction of identical portions betweenthe aligned bases of an assembly and the reference (i.e.,
Accuracy ) as calculated by dnadiff, and a
Polishing Score value thatis the product of
Accuracy and
Aligned Bases (as a fraction). We report the runtime and the memory requirements of theassembly polishing tools. For the rows that do not specify assembly polishing algorithms, we only report the runtime andthe memory requirements of the assemblers as well as accuracy of the unpolished assembly that they construct. We showthe best result among assembly polishing algorithms for each performance metric in bold text. ∗ denotes that Miniasmcannot produce an assembly given the specified set of reads. Table S13: K-mer similarity between Illumina reads and the Yeast S288C assemblies
Dataset Assembler Aligner Sequencing Tech. Polishing 11-mer 21-mer 31-mer 51-merof the Reads Algorithm Sim. (%) Sim. (%) Sim. (%) Sim. (%)
Yeast S288C Reference — — — 100 / 100 99.96 / 99.87 99.87 / 99.71 99.73 / 99.59Yeast S288C Miniasm — — — 95.49 / 91.36 12.06 / 10.85 4.38 / 3.84 0.62 / 0.55Yeast S288C Miniasm Minimap2 PacBio Apollo 98.79 / 96.71 65.93 / 62.88 53.80 / 50.13 35.83 / 33.02Yeast S288C Miniasm Minimap2 PacBio Racon 99.39 / 98.63 88.15 / 86.21 82.35 / 79.89 72.60 / 69.48Yeast S288C Miniasm Minimap2 PacBio Quiver 99.89 / 99.34 99.38 / 98.42 99.07 / 98.19 98.98 / 97.63Yeast S288C Miniasm Minimap2 Illumina Apollo 98.35 / 96.65 77.96 / 74.13 69.85 / 66.35 59.06 / 55.89Yeast S288C Miniasm Minimap2 Illumina Pilon 98.84 / 96.25 84.87 / 79.60 82.25 / 77.24 80.12 / 74.60Yeast S288C Miniasm Minimap2 Illumina Racon 98.51 / 97.18 89.53 / 87.02 87.49 / 84.96 87.02 / 83.89Yeast S288C Canu — — — 100 / 99.45 99.91 / 99.09 99.86 / 98.97 99.60 / 98.56Yeast S288C Canu Minimap2 PacBio Apollo 99.94 / 99.45 99.87 / 99.11 99.74 / 98.95 99.46 / 98.58Yeast S288C Canu Minimap2 PacBio Racon 99.94 / 99.40 96.37 / 94.96 94.20 / 92.48 89.17 / 87.70Yeast S288C Canu Minimap2 PacBio Quiver 100 / 99.62 99.93 / 99.19 99.89 / 98.95 99.76 / 98.69Yeast S288C Canu Minimap2 Illumina Apollo 100 / 99.45 99.92 / 99.10 99.88 / 98.93 99.68 / 98.58Yeast S288C Canu Minimap2 Illumina Pilon 100 / 99.45 99.94 / 99.13 99.89 / 98.95 99.74 / 98.69Yeast S288C Canu Minimap2 Illumina Racon 100 / 99.45 99.94 / 99.15 99.89 / 98.95 99.75 / 98.67
We report the pairs of the percentage of 1) k-mers of Illumina reads present in the assembly and 2) k-mers of the assemblypresent in the Illumina reads (separated by “/") in k-mer Sim. using a fixed k-mer size (i.e., k ∈ { , , , } ). Wegenerate the assemblies for the Dataset s using the reads sequenced from PacBio. We use Canu and Miniasm assemblers asspecified in
Assembler . The reads specified under
Sequencing Tech. of the Reads are sequenced by the specified sequencingtechnology and are aligned to the assembly using the
Aligner . For the rows that do not specify assembly polishing algorithms,we only report the k-mer similarity between Illumina set of reads and either the unpolished assembly or the reference.
Dataset Assembler Aligner Sequencing Tech. Polishing GC Mapped Properly Avg. Coverageof the Reads Algorithm (%) Reads (%) Paired (%) Coverage ≥ × (%) Yeast S288C Reference — — — 38.30 99.94 99.71 73 99.95Yeast S288C Miniasm — — — 38.42 93.88 83.94 57 82.63Yeast S288C Miniasm Minimap2 PacBio Apollo 38.00 99.11 97.45 69 94.38Yeast S288C Miniasm Minimap2 PacBio Racon 38.26 99.51 98.64 70 96.34Yeast S288C Miniasm Minimap2 PacBio Quiver 38.39 99.61 99.29 71 98.04Yeast S288C Miniasm Minimap2 Illumina Apollo 38.22 97.10 94.98 66 90.53Yeast S288C Miniasm Minimap2 Illumina Pilon 38.41 96.86 88.65 66 87.78Yeast S288C Miniasm Minimap2 Illumina Racon 38.42 97.03 95.33 66 91.35Yeast S288C Canu — — — 38.17 99.94 99.73 71 98.81Yeast S288C Canu Minimap2 PacBio Apollo 38.17 99.94 99.73 71 98.83Yeast S288C Canu Minimap2 PacBio Racon 38.09 99.94 99.23 71 98.21Yeast S288C Canu Minimap2 PacBio Quiver 38.17 99.94 99.74 71 98.74Yeast S288C Canu Minimap2 Illumina Apollo 38.17 99.94 99.73 71 98.81Yeast S288C Canu Minimap2 Illumina Pilon 38.17 99.94 99.74 71 98.81Yeast S288C Canu Minimap2 Illumina Racon 38.17 99.94 99.73 71 98.81
We report the quality assessment of the assemblies as reported by QUAST [6]. QUAST reports the GC content and usesthe filtered Illumina reads to measure 1) percentage of the short reads that mapped to the assembly ( Mapped Reads ), 2)percentage of
Properly Paired reads that mapped within the expected range of each other to the assembly, 3) averagedepth of coverage (
Avg. Coverage ), and 4) percentage of the bases with at least 10 × coverage ( Coverage ≥ × ). Wegenerate the assemblies for the Dataset s using the reads sequenced from PacBio. We use Canu and Miniasm assemblers asspecified in
Assembler . The reads specified under
Sequencing Tech. of the Reads are sequenced by the specified sequencingtechnology and are aligned to the assembly using the
Aligner to polish the assembly. For the rows that do not specifyassembly polishing algorithms, we only report the quality assessment of either the unpolished assembly or the reference.
Table S15: K-mer similarity between Illumina reads and the human genome assemblies
Dataset Assembler Aligner Polishing 21-mer 31-mer 51-merAlgorithm Sim. (%) Sim. (%) Sim. (%)
Human HG002 Reference — — 98.05 / 87.02 96.98 / 84.73 93.56 / 80.14Human HG002 Minimap2 PacBio Apollo 93.74 / 82.62 91.05 / 79.18 85.26 / 73.11Human HG002 Minimap2 PacBio Quiver ∗ ∗ ∗ × ) Apollo 54.00 / 43.72 45.59 / 36.91 36.82 / 30.24Human HG002 BWA-MEM PacBio (9 × ) Apollo 53.97 / 42.76 45.61 / 36.10 36.95 / 29.66Human HG002 Minimap2 PacBio (9 × ) Racon 48.93 / 37.77 39.97 / 31.08 31.04 / 24.62Human HG002 BWA-MEM PacBio (9 × ) Racon 46.83 / 34.91 37.69 / 28.35 28.67 / 22.07 We report the pairs of the percentage of 1) k-mers of Illumina reads present in the assembly and 2) k-mers of the assemblypresent in the Illumina reads (separated by “/") in k-mer Sim. using a fixed k-mer size (i.e., k ∈ { , , } ). We polishthe human genome assembly in Dataset using PacBio or Illumina reads. The reads specified under
Sequencing Tech. of theReads are sequenced by the specified sequencing technology and are aligned to the assembly using the
Aligner . For the rowthat does not specify any assembly polishing algorithm, we only report the k-mer similarity between Illumina set of readsand the unpolished assembly that is already constructed and we use as reference. ∗ denotes that we polish the assemblycontig by contig in these runs and collect the results once all of the contigs are polished separately. Dataset Aligner Sequencing Tech. Polishing GC Mapped Properly Avg. Coverageof the Reads Algorithm (%) Reads (%) Paired (%) Coverage ≥ × (%) Human HG002 — — — 40.86 99.92 98.35 10 44.82Human HG002 Minimap2 PacBio Apollo 40.81 99.91 97.75 10 44.81Human HG002 Minimap2 PacBio Quiver 40.84 99.92 98.21 10 44.55Human HG002 Minimap2 PacBio Racon ∗ ∗ × ) Apollo 40.62 99.36 83.34 10 37.17Human HG002 BWA-MEM PacBio (9 × ) Apollo 40.62 99.29 82.54 10 36.04Human HG002 Minimap2 PacBio (9 × ) Racon 40.95 98.00 78.70 9 33.82Human HG002 BWA-MEM PacBio (9 × ) Racon 40.94 97.27 76.30 9 32.07 We report the quality assessment of the assemblies as reported by QUAST [6]. QUAST reports the GC content and usesthe filtered Illumina reads to measure 1) percentage of the short reads that mapped to the assembly ( Mapped Reads ), 2)percentage of
Properly Paired reads that mapped within the expected range of each other to the assembly, 3) average depthof coverage (
Avg. Coverage ), and 4) percentage of the bases with at least 10 × coverage ( Coverage ≥ × ). We polish thehuman genome assembly in Dataset using PacBio or Illumina reads. The reads specified under
Sequencing Tech. of theReads are sequenced by the specified sequencing technology and are aligned to the assembly using the
Aligner . For the rowsthat do not specify assembly polishing algorithms, we only report the quality assessment of the reference. ∗ denotes thatwe polish the assembly contig by contig in these runs and collect the results once all of the contigs are polished separately. Performance of the Aligners
Here in Table S17, we show the performances of the aligners in terms of number of alignments thatthe aligners generate given the assembly and the reads to align, runtime (wall clock), and the memoryrequirement. Table S17: Performance of the aligners
Dataset for Assembler Aligner Platform of the Number of Runtime Memorythe Assembly Aligned Reads Alignments (GB)
E. coli
K-12 - ONT Miniasm Minimap2 ONT 8,095,856 3m 30s 4.88
E. coli
K-12 - ONT Canu Minimap2 ONT 1,662,306 39s 2.10
E. coli
K-12 - ONT (30 × ) Canu Minimap2 ONT (30 × ) 170,910 6s 0.60 E. coli
O157 - PacBio Miniasm Minimap2 PacBio 732,397 25s 1.79
E. coli
O157 - PacBio Miniasm Minimap2 Illumina 21,933,051 1m 35s 3.16
E. coli
O157 - PacBio Canu Minimap2 PacBio 741,343 22s 1.80
E. coli
O157 - PacBio (30 × ) Canu Minimap2 PacBio (30 × ) 148,241 5s 0.67 E. coli
O157 - PacBio (30 × ) Canu Minimap2 PacBio (30 × , Corr) 137,620 3s 0.47 E. coli
O157 - PacBio Miniasm BWA-MEM Illumina 19,799,002 2m 34s 3.17
E. coli
O157 - PacBio Canu BWA-MEM Illumina 23,328,379 1m 16s 2.89
E. coli
O157 - PacBio (30 × ) Canu BWA-MEM Illumina 23,326,202 1m 20s 2.96 E. coli
O157 - PacBio Miniasm pbalign PacBio 49,561 12m 55s 6.36
E. coli
O157 - PacBio Canu pbalign PacBio 51,994 11m 29s 6.28
We generate the assembly using the reads specified under
Dataset for the Assembly . We use Canu [7] and Miniasm [8]assemblers as specified in
Assembler . The reads specified under
Platform of the Aligned Reads are aligned to the assemblyusing the
Aligner . We use Minimap2 [9] aligner for aligning both long and short reads to the assembly and BWA-MEM [10]aligner to align the short reads to the assembly. We report the performance of the aligners in terms of the number of thealigners (
Number of Alignments ), the runtime (
Runtime ), and the maximum memory requirement
Memory . Robustness of Apollo
Here in Tables S18, S19, S20, S21, we show the robustness of Apollo based on the parameters that has adirect affect on the machine learning algorithm. In each of the tables we show that Apollo is robust todifferent set of parameters.Table S18: Apollo’s robustness based on the chunk size of the long read and the contig
Long Read Contig Chunk Aligned Aligned AccuracyChunk Size Size Bases Bases (%)
Here we divide the long reads and the assembly into smaller chunks. We use
E. coli
O157 dataset, assembled with Miniasm.We divide long reads into smaller reads with lengths 1000, 5000, and 10000. Similarly, we divide the assembly contigs intosmaller contigs with lengths 25000, 50000, and 100000. We align each chunked read to each chunked contig. We report theperformance of Apollo given the chunked assembly and chunked reads.
Table S19: Apollo’s robustness based on the maximum deletion and filter size parameters
Max Filter Aligned Aligned AccuracyDeletion (-d) Size (-f ) Bases Bases (%)
Performance of Apollo with respect to the parameter that defines the maximum number of deletion in one transition ( d = 3 , d = 5 , d = 15 ). We also adjust the filter size ( f = 100 , f = 200 ) Max Filter Aligned Aligned AccuracyInsertion (-i) Size (-f ) Bases Bases (%)
Performance of Apollo with respect to the parameter that defines the maximum number of insertion states for each base( i = 1 , i = 5 , i = 10 ). We also adjust the filter size ( f = 100 , f = 200 ) Table S21: Apollo’s robustness based on the match transition, insertion transition probabilities, and thefilter size parameters
Match Transition Insertion Transition Filter Aligned Aligned AccuracyProbability (-tm) Probability (-ti) Size (-f ) Bases Bases (%)
Performance of Apollo with respect to the parameters that define the match and insertion transition probabilities ( tm = 0 . & ti = 0 . , tm = 0 . & ti = 0 . , tm = 0 . & ti = 0 . , tm = 0 . & ti = 0 . ). We also adjust the filter size ( f = 100 , f = 200 ) Parameters
We show the parameter settings of the aligners that we used to align the reads to the assembly inTable S22.Table S22: List of the parameters that are used to align the reads to the assemblies
Aligner Parameters
BWA-MEM -t 45Minimap2 (for PacBio) -x map-pb -a -t 45Minimap2 (for ONT) -x map-ont -a -t 45Minimap2 (for Illumina) -a -x sr -t 45pbalign –nproc 4520 eferences [1] Can Firtina, Ziv Bar-Joseph, Can Alkan, and A Ercument Cicek. Hercules: a profile HMM-basedhybrid error correction algorithm for long reads.
Nucleic Acids Research , 46(21):e125–e125, August2018.[2] Sean R. Eddy. Profile hidden Markov models.
Bioinformatics , 14(9):755–763, October 1998.[3] L. E. Baum. An inequality and associated maximization technique in statistical estimation of prob-abilistic functions of a Markov process.
Inequalities , 3:1–8, 1972.[4] Donald E. Knuth. Two Notes on Notation.
The American Mathematical Monthly , 99(5):403, May1992.[5] A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm.
IEEE Transactions on Information Theory , 13(2):260–269, April 1967.[6] Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. QUAST: quality assessmenttool for genome assemblies.
Bioinformatics , 29(8):1072–1075, April 2013.[7] Sergey Koren, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, andAdam M. Phillippy. Canu: scalable and accurate long-read assembly via adaptive k -mer weightingand repeat separation.
Genome Research , 27(5):722–736, May 2017.[8] Heng Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.
Bioinformatics , 32(14):2103–2110, July 2016.[9] Heng Li. Minimap2: pairwise alignment for nucleotide sequences.
Bioinformatics , 34(18):3094–3100,September 2018.[10] Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows-Wheeler trans-form.