Haplotype-resolved de novo assembly with phased assembly graphs
Haoyu Cheng, Gregory T Concepcion, Xiaowen Feng, Haowen Zhang, Heng Li
HHaplotype-resolved de novo assembly with phasedassembly graphs
Haoyu Cheng , Gregory T Concepcion , Xiaowen Feng , Haowen Zhang , and HengLi Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA Pacific Biosciences, Menlo Park, CA, USA College of Computing, Georgia Institute of Technology, Atlanta, GA, USA * To whom correspondence should be addressed: [email protected]
ABSTRACT
Haplotype-resolved de novo assembly is the ultimate solution to the study of sequence variations in a genome. However,existing algorithms either collapse heterozygous alleles into one consensus copy or fail to cleanly separate the haplotypesto produce high-quality phased assemblies. Here we describe hifiasm, a new de novo assembler that takes advantage oflong high-fidelity sequence reads to faithfully represent the haplotype information in a phased assembly graph. Unlike othergraph-based assemblers that only aim to maintain the contiguity of one haplotype, hifiasm strives to preserve the contiguity ofall haplotypes. This feature enables the development of a graph trio binning algorithm that greatly advances over standardtrio binning. On three human and five non-human datasets, including California redwood with a ∼ Introduction
De novo genome assembly is the most comprehensive method that provides unbiased insight to DNA sequences. With therapid advances in long-read sequencing technologies such as Pacfic Biosciences (PacBio) and Oxford Nanopore (ONT), manylong-read assemblers have been developed to tackle this essential computational problem. Most of them collapse differenthomologous haplotypes into a consensus representation with heterozygous alleles frequently switching in the consensus. Thisapproach works well for inbred samples that are nearly homozygous but necessarily misses half of the genetic informationin a diploid genome. To solve this problem, Falcon-Unzip recovers heterozygous alleles by “unzipping” them in an initialcollapsed assembly. It produces a pair of assemblies, one primary assembly representing a mosaic of homologous haplotypes,and one alternate assembly composed of short haplotype-specific contigs (haplotigs) for alleles absent from the primaryassembly. The alternate assembly is often fragmented and does not represent a complete haplotype, making it less usefulin practice. In addition, starting from a collapsed assembly, Falcon-Unzip may not recover highly heterozygous regions notproperly collapsed in the initial assembly. Trio binning addresses these issues by globally partitioning long reads upfront withparental short reads and then performing two separate assemblies on the partitioned reads. This strategy works well for sampleswith high heterozygosity, but for a human sample sequenced with noisy long reads, it only produces fragmented assemblieswith ∼ produced by PacBio has changed the equation. Generated from the consensusof multiple sequences of the same DNA molecule, HiFi reads have a much lower error rate of <1%. With HiFi, standard triobinning can produce contigs of 17 Mb . Recent work relying on Hi-C or Strand-seq read binning can achieve bettercontiguity and phasing accuracy. These pre-binning algorithms all use short k-mers or short reads to partition HiFi reads. Theymay not identify haplotype-specific markers in complex regions and result in wrong read partitions which will negatively affectthe assembly as we will show later. In addition, both Hi-C and Strand-seq binning start with a collapsed assembly and have thesame issues as Falcon-Unzip.In 2012, we reasoned that a perfectly constructed unitig graph with read information is a lossless representation ofsingle-end reads. Because this graph is lossless, we can compress input reads into a unitig graph and perform phasing later.This should maximize the power of long HiFi reads. Developed in parallel to our work, HiCanu follows a similar rationale a r X i v : . [ q - b i o . GN ] A ug CCAACTTT CCCC G
Heterozygous alleleSequencing error Sequencing error1 2 3 4 Input HiFi reads
Phased string graph construction
Phased string graph Haplotype-resolved assembly construction
Complementary data
Primary assembly constructionPrimary assembly
Alternate assembly
Haplotype 1 assembly
12 11 10 9
Haplotype 2 assembly
CAA C C A AC Corrected reads CA Haplotype-aware error correction
Heterozygous allele
Figure 1.
Outline of the hifiasm algorithm.
Reads in orange and blue represent the reads with heterozygous alleles carryinglocal phasing information, while reads in green come from the homozygous regions without any heterozygous alleles. Inphased string graph, a vertex corresponds to the HiFi read with same ID, and an edge between two vertices indicates that theircorresponding reads are overlapped with each other. Hifiasm first performs haplotype-aware error correction to correctsequence errors but keep heterozygous alleles, and then builds phased assembly graph with local phasing information from thecorrected reads. Only the reads coming from the same haplotype are connected in the phased assembly graph. Withcomplementary data providing global phasing information, hifiasm generates a completely phased assembly for each haplotypefrom the graph. Hifiasm also can generate unphased primary assembly only with HiFi reads. This unphased primary assemblyrepresents phased blocks (regions) that are resolvable with HiFi reads, but does not preserve phasing information between twophased blocks.and can produce Falcon-Unzip-style primary/alternate assemblies better than other assemblers especially around segmentalduplications. However, HiCanu only tries to keep the contiguity of one parental haplotype and often breaks the contiguity ofthe other haplotype. When we separate parental haplotypes, these break points will lead to fragmented haplotype-resolvedassemblies. HiCanu is not making use of the full power of HiFi reads.In this article we present hifiasm, a new assembler for HiFi reads that generates a well connected assembly graph andproduces better assemblies in practice. We will first give an overview of the hifiasm algorithm, compare it to other assemblersfor partially phased assemblies and then explain and evaluate the haplotype-resolved assembly algorithm used by hifiasm.
Results
Overview of the hifiasm algorithm
The first few steps of hifiasm broadly resemble the workflow of early long-read assemblers (Fig. 1). Hifiasm performsall-vs-all read overlap alignment and then corrects sequencing errors. Given a target read to be corrected, hifiasm inspects thealignment of reads overlapping with the target read. A position on the target read is said to be informative if there are two typesof A/C/G/T bases (gaps ignored) at the position in the alignment and each type is supported by at least three reads. A read able 1.
Statistics of non-human assemblies
Dataset Assembler Size N50 NG50 Alternate Completeness (asmgene or BUSCO)(Gb) (Mb) (Mb) size (Gb) Complete (%) Duplicated (%)
M. musculus (25 × ) hifiasm 2.606 21.1 20.6 0.087 99.72 0.23HiCanu 2.604 16.4 15.9 0.033 99.70 0.23Peregrine 2.578 17.9 17.0 0.029 99.56 0.21Falcon 2.559 19.3 16.7 0.025 99.49 0.14 Z. mays (22 × ) hifiasm 2.190 36.7 36.7 0.106 99.84 0.18HiCanu 2.183 14.5 14.5 0.284 99.85 0.48Peregrine 2.206 1.8 1.9 0.117 99.79 0.27Falcon 2.132 9.5 9.3 0.016 99.77 0.17 F. × ananassa (36 × ) hifiasm (purge) 0.826 17.8 17.8 0.473 98.51 93.06HiCanu 1.214 8.7 14.8 0.098 97.96 92.87HiCanu (purge) 0.381 10.9 0.931 93.56 41.26Peregrine 0.930 5.5 6.7 0.260 98.33 91.70Falcon 0.971 5.4 7.3 0.213 98.27 92.81 R. muscosa ( ∼ × ) hifiasm (purge) 9.535 9.3 7.588 66.81 1.59Peregrine 9.415 0.9 2.936 66.84 1.72 S. sempervirens ( ∼ × ) hifiasm (purge) 35.624 5.4 16.306 61.82 39.85Peregrine 35.662 0.8 63.20 35.93HiCanu (purge) applies Purge_dups to a HiCanu assembly. Hifiasm (purge) enables the built-in Purge_dups equivalent strategy. The N50/NG50 of an assemblyis defined as the sequence length of the shortest contig at 50% of the total assembly/genome size. To calculate the NG50, a genome size of 2730.9 Mb(AC:GCF_000001635.20), 2182.1 Mb (AC:GCA_902167145.1) and 813.4 Mb is used for M. musculus , Z. mays and F. × ananassa , respectively. Thegenome size is unknown for R. muscosa and
S. sempervirens . The NG50 of HiCanu (Purge) is not available since its size is less than 50% of the genome size.“Alternate size” is the total length of the alternate assembly. The reference-based asmgene method is used to evaluate the gene completeness of M. musculus and
Z. mays which have high-quality reference genomes. For these two samples, “Complete” gives the percentage of single-copy genes in the referencegenome (one unique mapping at ≥
97% identity) that are mapped at ≥
97% identity to the assembly; “Duplicated” gives the percentage of reference single-copygenes that become multi-copy in the assembly. The BUSCO embryophyta dataset is used to evaluate the gene completeness of F. × ananassa and S.sempervirens ; the tetrapoda dataset is used for
R. muscosa . overlapping with the target read is inconsistent with the target if there are informative positions in the overlap and the read isnot identical to the target read across all these positions; accordingly, the overlap between this and the target read is inconsistent.Inconsistent reads theoretically originate from a haplotype different from the target read. Hifiasm only uses consistent reads tocorrect the target read.Hifiasm performs three rounds of error correction by default. It then does overlap alignment again and builds a stringgraph where a vertex is an oriented read and an edge is a consistent overlap. After transitive reduction, a pair of heterozygousalleles will be represented by a “bubble” in the string graph (Fig. 1). No information is lost. If there are no additional data,hifiasm arbitrarily selects one side of each bubble and outputs a primary assembly similar to Falcon-Unzip and HiCanu. For aheterozygous genome, the primary assembly achieved at this step may still contain haplotigs from more than one homologoushaplotypes. With HiCanu, we have to post-process the assembly with purge_dups to remove these redundant haplotigs.Hifiasm natively implements a variant of the purge_dups algorithm to identify and remove such haplotigs. This simplifies theassembly pipeline.If parents of the sample are also sequenced, hifiasm can use k-mer trio binning to label corrected reads in the string graph.In this case, hifiasm effectively discards the maternal unitigs to generate the paternal assembly, and vice versa. This graph-basedtrio binning may go through regions heterozygous in all three samples in the trio and is more robust to the mislabeling of reads.We will explain the advantage of hifiasm binning in a later section. Assembling homozygous non-human genomes
We first evaluated hifiasm v0.7 along with Falcon-Unzip v1.8.1 , Peregrine v0.1.6.1 and HiCanu v2.0 on two inbredsamples including the C57/BL6J strain of M. musculus (mouse) and the B73 strain of
Z. mays (maize). All assemblersproduced long contigs for mouse (Table 1). To evaluate how often assemblers collapse paralogous regions and produce amisassembly, we mapped HiFi reads to each assembly, extracted apparently heterozygous SNPs at high coverage and clusteredthem into longer regions (Online Methods). These regions correspond to collapsed misassemblies. We identified 5 suchmisassemblies in the HiCanu assembly, 17 in hifiasm and more than 100 in both Falcon and Peregrine. HiCanu is the best atthis metric although its contig N50 is the shortest.For the repeat-rich maize genome, hifiasm and HiCanu generated longer contigs and again produced much fewer collapsedmisassemblies (2 for hifiasm and 3 for HiCanu, versus more than 70 misassemblies for Falcon and Peregrine). Hifiasm andHiCanu perform better presumably because they can more effectively resolve repeats by requiring near perfect overlap . able 2. Statistics of human primary assemblies
Dataset Assembly Size NG50 NGA50 QV Multi-copy genes Resolved Gene completeness (asmgene)(Gb) (Mb) (Mb) retained (%) BACs (%) Complete (%) Duplicated (%)CHM13(HiFi 32 × ONT 120 × ) hifiasm 3.043 88.1 65.4 54.3 76.9 95.3 99.14 0.28HiCanu 3.047 76.3 59.4 53.9 76.7 96.5 99.13 0.33Peregrine 2.990 36.5 33.2 43.8 41.4 38.4 98.84 0.26Falcon 2.862 26.3 23.8 50.1 24.6 33.1 98.62 0.11Canu (ONT) 2.992 74.1 60.5 26.6 61.6 92.1 97.79 0.27HG00733(HiFi 33 × ONT 50 × ) hifiasm (purge) 3.039 70.0 56.8 49.8 67.3 83.2 99.09 0.31HiCanu (purge) 2.932 35.2 31.6 50.7 62.4 73.7 97.76 0.33Peregrine 3.035 30.1 30.1 40.5 37.2 38.5 98.70 0.31Falcon 2.861 24.4 23.2 46.3 33.6 38.0 96.51 0.15Canu (ONT) 2.834 40.5 35.1 22.7 22.5 69.3 91.26 0.14HG002(HiFi 36 × ONT 80 × ) hifiasm (purge) 3.063 98.7 65.4 51.4 74.8 99.31 0.35HiCanu (purge) 3.000 44.7 35.9 52.1 67.1 98.97 0.23Peregrine 3.081 33.4 32.5 41.3 42.5 99.14 0.36Falcon 2.955 30.4 29.0 46.7 36.6 99.00 0.20Canu (ONT) 2.831 32.3 30.5 21.9 19.6 88.94 0.21Assemblies of ONT reads were obtained from other publications ; other assemblers use HiFi only. HiCanu and hifiasm were run without duplicationpurging for the homozygous CHM13 cell line, and run with purging for the heterozygous HG00733 and HG002 cell lines. The NGA50 of an assembly isdefined as the length of the correctly aligned block at 50% of the total reference genome size which is assumed to be 3.1 Gb. It is calculated based on theminigraph contig-to-GRCh38 alignment. The “QV” (quality value) of an assembly equals the Phred-scaled contig base error rate measured by comparing31-mers in contigs to 31-mers in short reads from the same sample. Percent “multi-copy genes retained” is reported by asmgene (Online Methods). It is thepercentage of multi-copy genes in GRCh38 (multiple mapping positions at ≥
99% sequence identity) that remain multi-copy in the assembly. A BAC isresolved if 99.5% of its bases can be mapped the assembly. There are 341 BACs for CHM13 and 179 BACs for HG00733. HG002 does not have BAC data.The low gene completeness numbers for ONT assemblies are caused by their low QV.
Assembling heterozygous non-human genomes
Since most practical samples are heterozygous, we next evaluated the assemblers on three heterozygous datasets from F. × ananassa (garden strawberry), R. muscosa (mountain yellow-legged frog) and
S. sempervirens (California redwood). Thesesamples are more challenging to assemble. F. × ananassa has an allopolyploid genome estimated to be 813.4 Mb in size .All assemblers achieve a total assembly of ∼ regards most single-copy genes to be duplicated, consistent with the previous observation . HiCanu assigns most contigs to theprimary. Applying Purge_dups overcompresses the assembly and reduces the BUSCO completeness by 5%. Falcon-Unzipand Peregrine are somewhat between hifiasm and HiCanu. The varying primary assembly sizes highlight the difficulty inassembling polyploid genomes. On the other hand, all HiFi assemblies here have much longer contig N50 than the publishedassembly (>5 Mb vs 580 kb). HiFi enables better assembly. R. muscosa is hard to assemble for its large genome size. We failed to run Falcon-Unzip and HiCanu for this sample usingtheir released versions. Both hifiasm and Peregrine were successful. The N50 of the hifiasm assembly is ten times as long.
S. sempervirens poses an even greater challenge to assembly with a much larger hexaploid genome. Hifiasm took 875 Gbreads as input and produced a 35.6 Gb assembly in 2.5 days over 80 CPU threads using ∼
700 GB memory at the peak. The flowcytometric estimate of the full hexaploid genome is 62.8 Gb in size . Our assembly is about half of that. Peregrine achieved an35.6 Gb assembly as well. Its BUSCO score is 1.4% better than the hifiasm assembly. However, BUSCO completeness may notbe reliable. Peregrine also took 10 days on a computer cluster. It runs slower and its assembly is more fragmented. Hifiasmoverall performs better on large genomes. Primary assembly of human genomes
We next evaluated hifiasm and other assemblers on three human datasets (Table 1). We introduced two new metrics, “multi-copygenes retained” and “resolved BACs” to evaluate how assemblers resolve difficult genomic regions such as long segmentalduplications. If an assembler breaks contigs at such regions or misassemblies the regions, the resulting assembly will lose multi-copy genes and/or leads to fragmented BAC-to-contig alignment. We also used NGA50 to measure misassemblies. However,alignment cannot go through assembly gaps in GRCh38 or long insertions/deletions. NGA50 is often an underestimate.CHM13 is a homozygous cell line, similar to
M. musculus and
Z. mays . Hifiasm delivers a more contiguous assembly.HiCanu resolves a few more BACs, but it produced slightly more collapsed misassemblies (26 for HiCanu vs 20 for hifiasm).The two assemblers are broadly comparable. Both of them are better than Peregrine, Falcon and the Canu ONT assembly onall metrics by a large margin. AC C CA Heterozygous allele3 4 51 2 9 106 HiFi reads A AA CCC
12 13
15 14 17
Mispartitioned read
Haplotype 1 assembly
Haplotype 2 assembly Phased string graph
Mispartitioned read Mispartitioned read
Haplotype 1 assemblyHaplotype 2 assembly
Fixed mispartition
Pre-binning strategy Graph-binning strategy ab c
Figure 2.
Effect of false read binning. (a) a set of reads with global phasing information provided by the complementarydata. Reads in orange and reads in blue are specifically partitioned into haplotype 1 and haplotype 2, respectively. Theremaining reads in green are partitioned into both haplotypes. Read 9 without heterozygous alleles is mispartitioned intohaplotype 2, instead of to both haplotypes. (b)
A pre-binning assembly produced by current methods which independentlyassemble two haplotypes. Haplotype 1 is broken into two contigs due to the mispartition of read 9. (c)
Hifiasm fixes themispartition by the local phasing information in the phased assembly graph. It is able to identify that read 9 does not haveheterozygous alleles, so that read 9 should be partitioned into both haplotypes.For heterozygous human samples HG00733 and HG002, HiCanu produced primary assemblies of >3.5 Gb in size withseveral hundred megabases of heterozygous regions represented twice. We had to run Purge_dups to remove these falselyduplicated regions in the primary assembly. We tried a few Purge_dups settings, including the default, and chose the one thatgave the best primary assembly. Hifiasm can identify and remove falsely duplicated regions by inspecting inconsistent readoverlaps between them. Peregrine, Falcon and Shasta collapse most heterozygous regions during assembly. They do not needadditional tools like Purge_dups, either.For the two heterozygous samples, hifiasm and HiCanu consistently outperform other assemblers. The hifiasm assembly ismore complete and resolves more difficult regions than HiCanu. This difference probably has more to do with the duplicatepurging algorithm than with the capability of the assembler. Nonetheless, this observation suggests it is easier to produce ahigh-quality primary assembly with hifiasm.On running time, hifiasm takes 7–9 wall-clock hours over 48 threads. The peak memory is below 150 GB. Peregrine isabout twice as fast for human assembly but uses more memory. HiCanu is about eight times as slow as hifiasm using the samemachine. Falcon is the slowest. Improving haplotype-resolved assembly
A major issue with trio binning is that a fraction of heterozygous reads cannot be unambiguously partitioned to parentalhaplotypes: if both parents are heterozygous at a locus, a child read will harbor no informative k-mers and cannot be uniquelyassigned to a parental haplotype; if, say, the father is heterozygous at a locus and the mother is homozygous, reads from thematernal haplotype cannot be partitioned, either. With standard trio binning, heterozygous reads that cannot be partitionedwill be used in both parental assemblies. As a result, both alleles may be present in one haplotype assembly and lead to falseduplications. Standard trio binning is unable to cleanly separate the two parental haplotypes.Hifiasm draws power from HiFi read phasing in addition to trio binning. It does not partition reads upfront; it only labelsreads in the string graph. In a long bubble representing a pair of heterozygous alleles, hifiasm may correctly phase it even ifonly a small fraction of reads are correctly labeled. This way hifiasm also rarely puts two alleles in one haplotype assembly.Hi-C or Strand-seq based phasing can unambiguously phase most heterozygous reads and are naturally immune tofalse duplications. They however suffer from another issue shared by standard trio phasing: reads assigned to a wrong parentalhaplotype may break contigs (Fig. 2). By considering HiFi read phasing and the structure of the assembly graph, hifiasm may igure 3.
Example hifiasm and HiCanu assembly graphs.
The graphs were generated from HG002 reads mapped tochr11:19,310,012-21,493,943. Red bars correspond to unitigs matching the maternal haplotype, blue to paternal, grey tohomozygous unitigs present on both parental haplotypes, and pink bars correspond to wrongly phased unitigs that join paternaland maternal haplotypes.
Table 3.
Statistics of haplotype-resolved human assemblies
Dataset Assembly QV NG50 Resolved Switch Hamming FNR FDR Duplicated(Mb) BACs (%) error (%) error (%) (%) (%) genes (%)HG00733 hifiasm (trio) 49.8 35.1 95.5 0.09 0.26 2.74 0.36HiCanu (trio) 47.3 10.2 92.7 0.04 0.04 6.18 1.78Peregrine (trio) 42.2 19.1 39.7 0.10 0.23 12.34 0.29Peregrine (Strand-seq) 45.8 26.6 46.9 0.18 0.72 3.99 0.15HG002 hifiasm (trio) 51.6 41.0 0.79 0.42 0.91 0.27 0.48HiCanu (trio) 48.1 12.4 0.70 0.18 2.38 0.30 1.77Peregrine (trio) 42.7 25.8 0.74 0.19 4.42 4.18 0.35Parental assemblies are merged together for computing QV, NG50 and BACs resolved. Calculating NG50 assumes a diploid human genome size of 6.2 Gb.Phased variants are called with dipcall for each pair of parental assemblies and are compared to HG002 truth variants from GIAB or HG00733 phasedSNPs from HGSVC . Phasing switch error rate: percent adjacent SNP pairs that are wrongly phased. Phasing hamming error rate: percent SNP sites that arewrongly phased. False negative rate (FNR): percent true variants that are missed in the assembly. False discovery rate (FDR): percent assembly-based variantcalls that are not called in the truth data. RTG’s vcfeval is used for estimating variant FNR and FDR for HG002. For HG00733, FNR is estimated atheterozygous SNP sites only; FDR is not available because HGSVC does not provide confident regions. Percent duplicated genes measures the percentage ofsingle-copy genes in GRCh38 that are duplicated in the assembly, averaged between the two parental haplotypes. be able to identify and fix such binning errors.It may be tempted to migrate the hifiasm graph binning algorithm to HiCanu. In practice, however, initially designed toproduce best overlap graphs, HiCanu often breaks bubbles in the string graph and misjoins unitigs from different parentalhaplotypes (Fig. 3). Implementing graph trio binning on top of HiCanu would lead to fragmented assemblies at these brokenbubbles. Graph binning is a unique feature of hifiasm. Haplotype-resolved assembly of heterozygous human genomes
To evaluate how well assemblers resolve both haplotypes, we applied trio binning assembly to HG00733 and HG002. Hifiasmperforms graph trio binning that partitions a diploid assembly graph to generate the final assembly. HiCanu does standard triobinning that partitions HiFi reads upfront and assembles the two parental partitions separately. Peregrine does not nativelysupport trio binning. We feed the HiCanu-partitioned reads to Peregrine for assembly. For comparison, we also acquired aStrand-seq HG00733 assembly that uses the same HiFi reads but are supplemented with additional data types for phasing.On both datasets, trio hifiasm misses fewer variants and emits longer contigs with higher QV and lower variant FDRthan other assembly strategies (Table 3). HiCanu achieves the lowest phasing error rates, but it has the highest level of falseduplications as is measured by “duplicated genes”. The HiCanu contig N50 is also the shortest. This is probably caused bywrongly partitioned reads (Fig. 2) in combination with HiCanu’s strict requirement of exact overlapping. Collapsing inexactoverlaps, Peregrine can remove most false duplications and is more robust to partition errors in certain cases and can achievelonger contigs. However, this comes with the cost of fewer resolved BACs and increased FNR. The Strand-seq assembly alsouses Peregrine. It is affected by false read partitions in the same way. This assembly is not as good as hifiasm on every metric. t is not possible to get a good all-around assembly if we perform separate assemblies on pre-partitioned reads. Discussion
Hifiasm is a fast open-source de novo assembler specifically developed for HiFi reads. It mostly uses exact overlaps to constructthe assembly graph and can separate different alleles or different copies of a segmental duplication involving a single segregatingsite. This greatly enhances its power for resolving near identical, but not exactly identical repeats and segmental duplications.In our evaluation, hifiasm consistently outperforms Falcon and Peregrine which do not take the advantage of exact overlaps.In comparison to HiCanu which is developed in parallel to our work, hifiasm is able to generate a more complete assemblygraph preserving all haplotypes more contiguously. This enables us to implement a graph trio binning algorithm that canproduce a haplotype-resolved assembly tripling the contig N50 of a trio HiCanu assembly. Hifiasm can generate the besthaplotype-resolved human assemblies so far.Our graph binning algorithm can also work with reads labeled by Hi-C or Strand-seq binning that do not require parental data.However, because existing Hi-C or Strand-seq binning algorithms start with a collapsed assembly, they may not work well withhighly heterozygous regions not represented well in the initial assembly. In our view, a better solution to pedigree-free phasedassembly is to map Hi-C or Strand-seq data to the hifiasm assembly graph, group and order unitigs into chromosome-longscaffolds with the graph topology, and then phase heterozygous events along the scaffolds. We envision that haplotype-resolvedassembly will become a common practice for both human and diploid non-human species, though haplotype-resolved assemblymay remain challenging for polyploid plants in the near future.
Acknowledgements
This study was supported by US National Institutes of Health (grant R01HG010040, U01HG010971 and U41HG010972 toH.L.).
Author contributions
H.C. and H.L. designed the algorithm, implemented hifiasm and drafted the manuscript. H.C. benchmarked hifiasm and otherassemblers. G.T.C. ran hifiasm for
S. sempervirens , ran Peregrine for
S. sempervirens and
R. muscosa and ran Falcon-Unzip forall datasets. H.Z. experimented the
R. muscosa assembly with hifiasm and HiCanu. X.F. helped evaluation.
Competing interests
G.T.C. is an employee of Pacific Biosciences. H.L. is a consultant of Integrated DNA Technologies, Inc and on the ScientificAdvisory Boards of Sentieon, Inc, BGI and OrigiMed.
Methods
Haplotype-aware error correction.
The first step of hifiasm is the haplotype-aware error correction. In this step, hifiasmloads all reads into memory, and performs all-vs-all pairwise alignment between them. By utilizing the alignment results,hifiasm is able to do haplotype phasing using heterozygous SNPs. Given a reference read R and its overlaps to related reads, weneed to identify the real SNPs and ignore the sequencing errors. To this end, hifiasm collects all mismatches from the pairwisealignment between R and its overlapping reads. If one mismatch is supported by three overlapping reads, hifiasm takes it as aSNP, otherwise it would be ignored as a sequencing error (Fig. 1). For a read Q overlapped with R , it comes from the samehaplotype of R only if there is no difference on SNP sites between Q and R . To avoid overcorrection and keep the informativeheterozygosity variants carrying haplotype information, hifiasm only uses the reads coming from the same haplotype for errorcorrection. This strategy is also helpful to discard reads coming from the highly repetitive regions since the base differencesbetween these regions are also treated as SNPs. The sequencing errors on each read are then corrected by the consensus methodfrom pairwise alignment . Although in theory the pairwise-alignment-based consensus method is not as good as the traditionalPartial Order Alignment (POA) method when correcting noisy reads, our algorithm should be able to generate comparableresults with substantially less running time for HiFi reads due to their low sequencing error rate.The major bottleneck in this step is the all-vs-all pairwise alignment. In order to accelerate it, existing assemblers usuallyfirst extract a spare sample from each read, and then perform alignment on top of samples instead of the whole sequences .However, the sample-based alignment loses the details of overlaps, while both the phasing and the error correction in hifiasmrequire the highly accurate base-level alignment. Hifiasm adopts the bit-vector algorithm to significantly reduce the alignmenttime. It is able to calculate multiple cells in alignment matrix at once using simple bit operations, while the pairwise base-levelalignment algorithms in current assemblers need to calculate them one-by-one. Moreover, hifiasm further improves the lignment performance by splitting reads into relatively small non-overlapping windows and calculating the alignment ofwindows. Since each window is small enough, we can take advantage of CPU SSE instructions to simultaneously performbit-vector algorithm on multiple windows . In practice, one potential challenge of the window-based strategy is that thealignment results on window extremities might be not reliable. To deal with this challenge, hifiasm re-aligns the subregionsaround the boundary between two adjacent windows. Constructing phased assembly graphs.
After haplotype-aware error correction, most sequencing errors have been removedwhile the informative heterozygous variants are still kept. With nearly error-free reads, hifiasm is able to perform phasingaccurately to determine if one overlap is among the reads coming from different haplotypes (i.e. inconsistent overlap). The nextstep is to build the assembly string graph . In this graph, nodes represent oriented reads and each edge between two nodesrepresents the overlap between the corresponding two reads. Note that only consistent overlaps are used to build the graph.Since hifiasm builds the graph on top of nearly error-free reads and highly accurate haplotype phasing, the produced assemblygraph of hifiasm is simpler and cleaner than those of current assemblers for haploid genomes. However, for diploid genomesor polyploid genomes, its graph becomes more complicated as reads from different haplotypes are clearly separated out byphasing. Fig. 1 gives an example. Since there is a heterozygous allele on reads in orange and blue, hifiasm separates them intotwo groups in which all reads in the same color belong to one group. Only the reads from same group are overlapped with eachother. For reads in green, they are overlapped with the reads in both groups because the overlaps among them are not longenough to cover at least one heterozygous allele. As a result, hifiasm generated a bubble in the assembly graph. Most existingassemblers aim to produce one contiguous contig from the graph (i.e. single path in the graph) as much as possible. They tendto collapse bubbles when building the assembly graph. As a result, they will lose all but one allele in each bubble. In contrast,hifiasm is designed to retain all bubbles on the assembly graph. Owing to the fact that there are still a few errors at the correctedreads, hifiasm adopts a topological-aware graph cleaning strategy to cut too short overlaps and avoid destroying substructuresembedding local phasing information like bubbles. Hifiasm additionally records the polyploid overlaps, which are very helpfulin the following assembly construction steps.
Constructing a primary assembly.
The construction of the primary assembly aims to produce contigs including one set ofhaplotypes but may switch subregions between haplotypes. In other words, each subregion in the primary assembly only comesfrom one haplotype, while the corresponding subregions of other haplotypes are removed as duplications. In this step, mostexisting assemblers follow the “best overlap graph” strategy or its variants . Their key idea is to retain longer overlaps if thereare multiple overlaps to a given read. In contrast, hifiasm produces a primary assembly mainly relied on the graph topologicalstructures and the phasing relationship among different haplotypes. Ideally, the phased assembly graph of hifiasm should bea chain of bubbles for diploid genomes (see Fig. 3). It is easy and reliable to extract primary assembly from such chain ofbubbles by bubble popping . However, there are still tips (i.e. deadend contigs broken in single end) on the assembly graphcaused by broken bubbles due to lack of coverage, phasing errors or unresolvable repeats. To fix this problem, hifiasm proposesa three-stage procedure. First, each bubble in the graph is reduced into a single path using bubble popping. This step removesmost duplicated subregions on different haplotypes without hampering the contiguity of primary assembly. Second, given a tipunitig T that is broken in one end but connected to a unitig C in another end, hifiasm checks if there are other contigs, whichare also connected to C , coming from the different haplotypes of T . If such contigs are identified, hifiasm removes tip T sothat unitig C will become longer. The reason is that for T , its corresponding region from different haplotype has already beenintegrated into the new longer unitig C . Since hifiasm records overlaps between haplotypes (i.e. inconsistent overlaps), it cancheck if two contigs come from different haplotypes. Last, hifiasm uses the “best overlap graph” strategy to deal with a fewremaining unresolvable hard substructures on the assembly graph. In most cases, the graph topological information and thephasing information is more reliable than only keeping the longer overlaps. As a result, hifiasm is able to generate a betterprimary assembly than current assemblers which mainly rely on “best overlap graph” strategy. Constructing a haplotype-resolved assembly.
The phased assembly graph in hifiasm embeds the local phasing informationthat is resolvable with HiFi reads. In this graph, the corresponding node of a homozygous read is at a single path connectingtwo bubbles, while the corresponding node of a heterozygous read is at a bubble (see Fig. 2). Given parental short reads,hifiasm labels child HiFi reads with the existing k-mer based algorithm . When generating a fully phased assembly for onehaplotype, hifiasm drops reads of different haplotypes from the graph, while using the local phasing information in graph tocorrect the mispartition of global phasing. Hifiasm does not drop reads at a single path connecting two bubbles, since theseare homozygous reads that must be contained in both haplotypes. For a bubble in which all reads are heterozygous, hifiasmapplies bubble popping to select a single best path consisting of most reads with the expected haplotype label. If a few reads areassigned false labels by global phasing, they are likely to be corrected by the best path that traverses through them. In addition,instead of dropping any read with non-expected haplotype label, hifiasm drops a contig if the haplotype labels of most reads in t are non-expected. Purging heterozygous duplications.
In the primary assembly construction step, accurately keeping one set of haplotypes ismore challenging for haplotype-resolved assemblers. Although the bubble popping method and the tip removing method ofhifiasm already purge large numbers of duplications from multiple haplotypes, some duplications still remain on the primaryassembly, especially for subregions with a high heterozygosity rate. Existing assemblers postprocess the primary assemblyusing downstream tools like Purge_dups , which identify duplications by inexact all-vs-all contig alignment. If two contigsare overlapping with each other, the overlapped regions between them are duplications. However, inexact contig alignmentmight be not reliable on segmental duplications or repeats, leading to more duplications left or overpurged repetitive regions.To address this duplication challenge, hifiasm re-assembles the contigs by building a string graph regarding contigs as nodes,called a purge graph. Given contig A and contig B , we define A inconsistently overlaps B if there are enough reads of A that areinconsistent overlapped with the reads of B . Note that hifiasm records all inconsistent overlaps among reads in the initial phasedassembly graph construction step by haplotype phasing. In the purge graph of hifiasm, each node is a contig, while an edgebetween two nodes is an inconsistent overlap between their corresponding contigs. Once the graph is built, hifiasm generatesthe non-redundant primary assembly by simple graph cleaning. As a result, the built-in purge duplication step of hifiasm issmoother and more reliable than existing downstream tools. This is because hifiasm identifies duplications from multiplehaplotypes using accurate haplotype phasing of each read, while existing tools mainly rely on inexact contig alignment. Evaluating collapsed misassemblies for inbred samples.
We mapped HiFi reads with minimap2 to each assembly andthen called apparent heterozygous SNPs with htsbox, a fork of samtools. We selected biallelic SNPs such that each allele issupported by d reads where d is set to 75% of the average coverage of the sample. We then hierarchically cluster these apparentSNPs as follows: we merge two adjacent SNP clusters if (1) the minimum distance between them is within 10kb and (2) thedensity of SNPs in the merged cluster is at least 1 per 1kb. A cluster longer than 5kb and consisting of ≥
10 SNPs is identifiedas a collapsed misassembly. Varying the thresholds changes the number of estimated misassemblies but does not alter therelative ranking between assemblers.
Evaluating gene completeness with asmgene.
BUSCO is a popular tool for evaluating gene completeness. It is very helpfulfor new species, but is underpowered for species with high-quality reference genomes. For example, BUSCO reports that thecompleteness of GRCh38 is only 94.8%, even lower than the 95.1% percent completeness of the CHM13 hifiasm assembly.We have also observed inconsistent BUSCO results at the gene prediction stage. In one case, a maize gene Zm00001d004099is mapped perfectly (without mismatches or gaps) to both the Peregrine and hifiasm assemblies. However, based on the122450at3193 gene in the BUSCO catalog, BUSCO predicts two different genes in the two assemblies and thinks hifiasm hasfragmented 122450at3193. Inconsistencies like this make it difficult to compare assemblies of similar quality.In order to quantify completeness more accurately, we used paftools in minimap2 to calculate the asmgene scores. UnlikeBUSCO, asmgene scores are generated using reference genome. It first aligns the cDNAs from EnsEMBL (v99 for human andmouse and plant v47 for maize) to both reference genome and assembly by minimap2, and then select single-copy genes in thereference genome. After that, it compares the alignment results of these genes in the reference genome to those in the assembly.In our experiments, it provided accurate completeness evaluation for M. musculus , Z. mays and human since these species havehigh-quality reference genomes.We also used asmgene to measure the resolution of genes in segmental duplicates. Similarly, the first step is to align thecDNAs to both reference genome and assembly by minimap2. It then selects potential genes in segmental duplications that arealigned to multiple regions of the reference genome, and checks how many these genes were kept multiple times on assembly.
Data availability
All HiFi data were obtained from NCBI Sequence Read Archive (SRA): SRR11606869 for
Z. mays , SRR11606870 for
M. musculus , SRR11606867 for F. × ananassa , SRR11606868 and SRR12048570 for R. muscosa , SRP251156 for
S. sem-pervirens , SRR11292120 through SRR11292123 for CHM13, ERX3831682 for HG00733, and four runs (SRR10382244,SRR10382245, SRR10382248 and SRR10382249) for HG002. For trio binning and computing QV, short reads werealso downloaded: SRR7782677 for HG00733, ERR3241754 for HG00731 (father), ERR3241755 for HG00732 (mother)and SRX1082031 for CHM13. GIAB’s “Homogeneity Run01” short-read runs were used for the HG002 trio. TheseHG002 reads were downsampled to 30-fold coverage. All hifiasm assemblies produced in this work are available atftp://ftp.dfci.harvard.edu/pub/hli/hifiasm/submission/. ode availability
Hifiasm is available at https://github.com/chhylp123/hifiasm.
References Chin, C.-S. et al.
Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
Nat. methods , 563 (2013). Berlin, K. et al.
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
Nat Biotechnol , 623–30 (2015). Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.
Bioinformatics , 2103–2110(2016). Koren, S. et al.
Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
Genome research , 722–736 (2017). Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs.
Nat.biotechnology , 540–546 (2019). Chin, C.-S. & Khalak, A. Human Genome Assembly in 100 Minutes. bioRxiv Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2.
Nat. Methods , 155–158 (2020). Shafin, K. et al.
Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes.
Nat. Biotechnol. Chen, Y. et al.
Fast and accurate assembly of nanopore reads via progressive error correction and adaptive read selection. bioRxiv
Chin, C.-S. et al.
Phased diploid genome assembly with single-molecule real-time sequencing.
Nat. methods , 1050(2016). Koren, S. et al.
De novo assembly of haplotype-resolved genomes with trio binning.
Nat. biotechnology , 1174–1182(2018). Wenger, A. M. et al.
Accurate circular consensus long-read sequencing improves variant detection and assembly of ahuman genome.
Nat. biotechnology , 1155–1162 (2019). Garg, S. et al.
Efficient chromosome-scale haplotype-resolved assembly of human genomes. bioRxiv
Porubsky, D. et al.
A fully phased accurate assembly of an individual human genome. bioRxiv
Li, H. Exploring single-sample snp and indel calling with whole-genome de novo assembly.
Bioinformatics , 1838–44(2012). Nurk, S. et al.
HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity longreads. bioRxiv (2020).
Myers, E. W. The fragment assembly string graph.
Bioinformatics , ii79–ii85 (2005). Guan, D. et al.
Identifying and removing haplotypic duplication in primary genome assemblies. bioRxiv
Edger, P. P. et al.
Origin and evolution of the octoploid strawberry genome.
Nat. genetics , 541–547 (2019). Li, H. Minimap2: pairwise alignment for nucleotide sequences.
Bioinformatics , 3094–3100 (2018). Hon, T. et al.
Highly accurate long-read HiFi sequencing data for five complex genomes. bioRxiv
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assemblyand annotation completeness with single-copy orthologs.
Bioinformatics , 3210–3212 (2015). Hizume, M., Kondo, T., Shibata, F. & Ishizuka, R. Flow cytometric determination of genome size in the taxodiaceae,cupressaceae sensu stricto and sciadopityaceae.
CYTOLOGIA , 307–311 (2001). Miga, K. H. et al.
Telomere-to-telomere assembly of a complete human X chromosome.
BioRxiv
Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs. arXiv preprint arXiv:2003.06079 (2020).
Li, H. et al.
A synthetic-diploid benchmark for accurate variant-calling evaluation.
Nat Methods , 595–597 (2018). Zook, J. M. et al.
An open resource for accurately benchmarking small variant and reference calls.
Nat. biotechnology ,561–566 (2019). Chaisson, M. J. P. et al.
Multi-platform discovery of haplotype-resolved structural variation in human genomes.
NatCommun , 1784 (2019). Cleary, J. G. et al.
Comparing variant call files for performance benchmarking of next-generation sequencing variantcalling pipelines. bioRxiv
Lee, C. Generating consensus sequences from partial order multiple sequence alignment graphs.
Bioinformatics ,999–1008 (2003). Myers, G. A fast bit-vector algorithm for approximate string matching based on dynamic programming.
J. ACM (JACM) , 395–415 (1999). Cheng, H., Jiang, H., Yang, J., Xu, Y. & Shang, Y. BitMapper: an efficient all-mapper based on bit-vector computing.
BMC bioinformatics , 192 (2015). Miller, J. R. et al.
Aggressive assembly of pyrosequencing reads with mates.
Bioinformatics , 2818–2824 (2008). upplementary Material for “Haplotype-resolved de novo assembly with phased assembly graphs” S1 Software commands
S1.1 Hifiasm
To produce primary assemblies of homozygous samples (
M. musculus , Z. mays and CHM13), hifiasm (ver-sion 0.7) was run with the following command which does not purge haplotig duplications: hifiasm -o
For heterozygous samples, hifiasm was run with the following command: hifiasm -o
We added ‘ -D10 ’ for the octoploid F. × ananassa because the default k-mer cutoff seems too low: hifiasm -o
S1.2 Falcon-Unzip
Falcon-kit (version 1.8.1) was run with the following HiFi-specific options: length cutoff pr = 8000ovlp daligner option = -k24 -h1024 -e.98 -l1500 -s100ovlp HPCdaligner option = -v -B128 -M24ovlp DBsplit option = -s400overlap filtering setting = --max-diff 200 --max-cov 200 --min-cov 2 --n-core 24 --min-idt98 --ignore-indels
Falcon-unzip-kit (version 1.3.7) was run with default options.
S1.3 HiCanu
For primary assembly, HiCanu (version 2.0) was run with the following command line: canu -p asm -d
The contigs labeled by ‘ suggestedBubbles=yes ’ were removed from the primary assembly. For trio-binning assembly, we ran HiCanu in two steps as recommended. We partitioned the HiFi reads by parentalshort reads with the following command: canu -haplotype -p asm -d
Note that ‘ -pacbio-raw ’ was used to partition HiFi reads followed the document of HiCanu. We thenperform HiCanu assemblies on partitioned reads. 1 a r X i v : . [ q - b i o . GN ] A ug For primary assembly, Peregrine (version 0.1.6.1) was run with the following command, where 48 is thenumber of threads in use: docker run -it -v
S1.5 Purge dups
Purge dups (version 1.2.3) was used to postprocess the output primary assemblies of HiCanu for all het-erozygous samples. The commands are as follows: minimap2 -I6G -xmap-pb
Since running Purge Dups in default cannot produce primary assembly of HiCanu with right size for HG002,we manually adjusted the cutoffs thresholds of Purge Dups as follows “ ”. S1.6 Running asmgene
We aligned the cDNAs to the reference genome and contigs by minimap2 r974 and evaluated the genecompleteness with paftools.js from the minimap2 package: minimap2 -cxsplice:hq -t
We set the sequence identity threshold to be 97% with ‘ -i.97 ’ to tolerate low per-base accuracy of ONTassemblies. For trio binning assemblies, we added option ‘ -a ’ to evaluate genes mapped to the autosomesonly. When evaluating multi-copy genes retained in an assembly, we replaced ‘ -i.97 ’ to ‘ -i.99 ’ to in-crease the resolution. S1.7 Computing NGA50
We used minigraph (version 0.10-dirty-r361) and paftools (version 2.17-r974-dirty) to calculate the NGA50of each asssembly: minigraph -xasm -K1.9g --show-unmap=yes -t
In comparison to minimap2, minigraph tends to generate longer alignments and is more robust to highlyvariable regions.
S1.8 BUSCO
BUSCO (version 3.0.2) was used with the following command: python3 run BUSCO.py -i
R. muscosa and set to embryophyta for F. × ananassa and S. sempervirens . 2
The resolution of BAC for different assemblies was evaluated using the pipeline at: https://github.com/skoren/bacValidation , except that we added option ‘ -I6g ’ to minimap2. The BAC libraries forCHM13 and HG00733 can be found at and , re-spectively.
S1.10 Running yak evaluation
We used yak (version r55) to measure the per-base consensus accuracy (QV), the switch error rate and thehamming error rate. For QV evaluation, we first built the index for the short reads coming from the samesample: yak count -b37 -t
To evaluate the switch error rate and the hamming error rate, we first built the indexes from the paternalshort reads as in section S1.1 and then estimate k-mer based error rates as follows: yak trioeval -t
S1.11 Dipcall
For the male sample HG002, we ran dipcall (version 0.1) as follows: dipcall.kit/run-dipcall -x dipcall.kit/hs37d5.PAR.bed
For the female sample HG00733, we removed option ‘ -x ’. We used the GRCh37 variant of ‘ hs37d5.fa ’here because GIAB works best with hs37d5. S1.12 Evaluating collapsed misassemblies for inbred samples
We used scripts at: https://github.com/lh3/CHM-eval/blob/master/misc/clustreg.js , and https://github.com/lh3/CHM-eval/blob/master/misc/select-collapse-het.js . The commands areas follows: minimap2 -axasm20 -t