[PDF] The design and construction of reference pangenome graphs

Abstract

The recent advances in sequencing technologies enables the assembly of individual genomes to the reference quality. How to integrate multiple genomes from the same species and to make the integrated representation accessible to biologists remain an open challenge. Here we propose a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome. We implemented our ideas in the minigraph toolkit and demonstrate that we can efficiently construct a pangenome graph and compactly encode tens of thousands of structural variants missing from the current reference genome.

Full PDF

LLi et al.

METHOD

The design and construction of referencepangenome graphs

Heng Li , Xiaowen Feng and Chong Chu Abstract

The recent advances in sequencing technologies enables the assembly of individual genomes to the referencequality. How to integrate multiple genomes from the same species and to make the integrated representationaccessible to biologists remain an open challenge. Here we propose a graph-based data model and associatedformats to represent multiple genomes while preserving the coordinate of the linear reference genome. Weimplemented our ideas in the minigraph toolkit and demonstrate that we can eﬃciently construct a pangenomegraph and compactly encode tens of thousands of structural variants missing from the current referencegenome.

Keywords: bioinformatics; genomics; pangenome

Background

The human reference genome is a fundamental re-source for human genetics and biomedical research.The primary sequences of the reference genome GRCh38 [1]are a mosaic of haplotypes with each haplotype seg-ment derived from a single human individual. Theycannot represent the genetic diversity in human pop-ulations and as a result, each individual may carrythousands of large germline variants absent from thereference genome [2]. Some of these variants are likelyassociated with phenotype [3] but are often missedor misinterpreted when we map sequence data toGRCh38, in particular with short reads [4]. Thisunder-representation of genetic diversity may becomea limiting factor in our understanding of genetic vari-ations.Meanwhile, the advances in long-read sequencingtechnologies make it possible to assemble a human in-dividual to a quality comparable to GRCh38 [1, 5].There are already a dozen of high-quality human as-semblies available in GenBank [6]. Properly integrat-ing these genomes into a reference pangenome , whichrefers to a collection of genomes [7], would potentiallyaddress the issues with a single linear reference.A straightforward way to represent a pangenome isto store unaligned genomes in a full-text index that * Correspondence: [email protected] Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA02215, USA Department of Biomedical Informatics, Harvard Medical School, Boston,MA 02215, USAFull list of author information is available at the end of the article compresses redundancies in sequences identical be-tween individuals [8, 9, 10]. We may retrieve individualgenomes from the index, inspect the k-mer spectrumand test the presence of k-mers using standard tech-niques. In principle, it is also possible to apply canon-ical read alignment algorithms to map sequences tothe collection, but in practice, the redundant hits tomultiple genomes will confuse downstream mapping-based analyses [11]. It is not clear how to resolve thesemultiple mappings.The other class of methods encodes multiple genomesinto a sequence graph, usually by collapsing identicalor similar sequences between genomes onto a singlerepresentative sequence. The results in a pangenomegraph . A pangenome graph is a powerful tool to iden-tify core genome, the part of a genome or gene set thatis shared across the majority of the strains or relatedspecies in a clade [12]. A common way to constructa basic pangenome graph is to generate a compactedde Bruijn graph (cDBG) [13, 14, 15, 16, 17, 18, 19]from a set of genomes. Basic cDBG does not keepsample information. [20] proposed colored cDBG witheach color represents a sample or a population. ColoredcDBG can be constructed eﬃciently [21, 22]. However,a colored cDBG discards the chromosomal coordinateand thus disallows the mapping of genomic features.It often includes connections absent from the inputgenomes and thus encodes sequences more than theinput. A colored cDBG cannot serve as a reference pangenome graph, either. deBGA [23] addresses theissue by labeling each unitig with its possibly multiple a r X i v : . [ q - b i o . GN ] M a r i et al. Page 2 of 12 locations in the input genome(s). Puﬀerﬁsh [24] fur-ther reduces its space requirement. Nonetheless, givenhundreds of human genomes, there will be many morevertices in the graph and most vertices are associatedwith hundreds of labels. Whether deBGA and puﬀer-ﬁsh can scale to such datasets remains an open ques-tion. GBWT [25] provides another practical solution tostorage and indexing, but no existing tools can prac-tically construct a cDBG for many human genomes inthe GBWT representation.In addition to cDBG, we can derive a referencepangenome graph from a single linear multi-sequencealignment (MSA) [26, 27]. It has been used for HLAtyping but is not applicable to whole chromosomeswhen they cannot be included in a single linear MSA.The third and possibly the most popular approachto reference graph generation is to call variants fromother sources and then incorporate these variants, of-ten in the VCF format [28], into the reference genomeas alternative paths [29, 30, 31, 32, 33, 34]. However,because VCF does not deﬁne coordinates on inser-tions, this approach cannot properly encode variationson long insertions and is therefore limited to simplevariations. There are no satisfactory solutions to theconstruction of reference pangenome graphs.In this article, we introduce the reference GraphicalFragment Assembly (rGFA) format to model referencepangenome graphs. We propose and demonstrate anincremental procedure to construct graphs under thismodel. The resulting graphs encode structural varia-tions (SVs) of length 100bp or longer without hap-lotype information. Our implementation, minigraph(https://github.com/lh3/minigraph), can construct apangenome graph from twenty human assemblies inthree hours.

Results

We will ﬁrst describe a data model for referencepangenome graphs, which establishes the foundationof this article. We will then present a new sequence-to-graph mapper, minigraph, and show how this map-per incrementally constructs a pangenome graph. Wewill demonstrate the utility of pangenome graphs witha human graph generated from twenty human haplo-types and a primate graph generated from four species.

Modeling reference pangenome graphs

Sequence graphs

There are several equivalent ways to deﬁne a sequencegraph. In this article, a sequence graph G ( V, E ) is abidirected graph. Each vertex v ∈ V is associatedwith a DNA sequence; each edge e ∈ E has two direc-tions, one for each endpoint, which leads to four types S s1 CTGAA SN:Z:chr1 SO:i:0 SR:i:0S s2 ACG SN:Z:chr1 SO:i:5 SR:i:0S s3 TGGC SN:Z:chr1 SO:i:8 SR:i:0S s4 TGTGA SN:Z:chr1 SO:i:12 SR:i:0S s5 TTTC SN:Z:foo SO:i:8 SR:i:1S s6 CTGA SN:Z:foo SO:i:12 SR:i:1S s7 GTTAC SN:Z:bar SO:i:5 SR:i:2L s1 + s2 + 0M SR:i:0L s2 + s3 + 0M SR:i:0L s3 + s4 + 0M SR:i:0L s2 + s5 + 0M SR:i:1L s5 + s6 + 0M SR:i:1L s6 + s4 + 0M SR:i:1L s1 + s7 - 0M SR:i:2L s7 - s6 + 0M SR:i:2 s7s1 s2 s3 s4s6s5 a b read1 6 0 6 + >s2>s3 7 1 7 6 6 60 cg:Z:6Mread2 7 0 7 + >s2>s5 11 1 8 7 7 60 cg:Z:7M c read1 6 0 6 + chr1 17 7 13 6 6 60 cg:Z:6Mread2 7 0 7 + >chr1:5-8>foo:8-16 11 1 8 7 7 60 cg:Z:7M d Figure 1 Example rGFA and GAF formats. (a)

ExamplerGFA format. rGFA-speciﬁc tags include SN, name of thestable sequence from which the vertex is derived; SO, oﬀset onthe stable sequence; SR, rank: 0 if the vertex or edge is on thelinear reference; > (b) Correspondingsequence graph. Each thick arrow represents an oriented DNAsequence. (c)

Example GAF format, using the segmentcoordinate, for reads “

GTGGCT ” and “

CGTTTCC ” mapped to thegraph. (d)

Equivalent GAF format using the stable coordinate. of edges: forward-forward, reverse-forward, forward-reverse and reverse-reverse. The directions on an edgedictate how a sequence is spelled from a walk/pathin the graph. Common assembly graphs, such as theoverlap graph, string graph and de Bruijn graph canall be formulated as sequence graphs.The Graphical Fragment Assembly (GFA) format [35]describes sequence graphs. The core of GFA is deﬁnedby the following grammar: <- ( | )+ <- ‘S’ <- ‘L’ [+-] [+-]

A line starting with letter “ S ” corresponds to a vertexand a line starting with “ L ” corresponds to a bidirectededge. In a de Bruijn graph, we often attach sequencesto edges instead of vertices [36, 37]. To avoid the con-fusion, in this article, we also call a vertex as a segment and call an edge as a link , following the GFA termi-nology. Fig. 1a shows an example GFA that encodesFig. 1b.A sequence graph in the GFA format natively deﬁnesa segment coordinate system where each base in thegraph is uniquely indexed by a 2-tuple (segId , segOﬀset).For example, in Fig 1a, the base at position (s2 ,

2) is“ G ”. A major problem with this coordinate is thatit is decoupled from linear annotations and is sensi-tive to graph transformations. For example, if we splita segment into two connected segments, the set ofsequences spelled from the graph remains the same, i et al. Page 3 of 12

Table 1

The Graphical mApping Format (GAF)Col Type Description1 string Query sequence name2 int Query sequence length3 int Query start coordinate (0-based; closed)4 int Query end coordinate (0-based; open)5 char Strand relative to col. 66 string Graph path matching regular expression /([><][^\s><]+(:\d+-\d+)?)+|([^\s><]+)/ but the segment coordinates will be changed. Due tothe instability of segment coordinate, a basic sequencegraph is inadequate for a reference graph.

Reference pangenome graphs

We propose the reference GFA (rGFA) format to en-code reference pangenome graphs. rGFA is an ex-tension to GFA with three additional tags that in-dicate the origin of a segment from linear genomes(Fig. 1a). This simple addition gives us a uniquestable coordinate system as an extension to the lin-ear reference coordinate (e.g. GRCh38). We can pin-point a position such as “ chr1:9 ” in the graph andmap existing annotations onto the graph. We canalso report a path or walk in the stable coordinate.For example, path “ s1 → s2 → s3 ” unambiguously cor-responds to “ chr1:0-5 → chr1:5-8 → chr1:8-12 ” or simply“ chr1:0-12 ” if we merge adjacent coordinate; similarly,“ s1 → s2 → s5 → s6 ” corresponds to “ chr1:0-8 → foo:8-16 ”.We will formally describe the path format when intro-ducing the GAF format in the next section.In rGFA, each segment is associated with one origin.This apparently trivial requirement in fact imposes astrong restriction on the types of graphs rGFA can en-code: it forbids the collapse of diﬀerent regions fromone sequence, which would often happen in a cDBG.We consider this restriction an advantage of rGFA be-cause it requires the graph to have a “linear” ﬂavorintuitively and simpliﬁes the data structure to storethe graph.For simplicity, rGFA disallows overlaps betweenedges and forbids multiple edges (more than one edgesbetween the same pair of vertices). These two restric-tions help to avoid ambiguity and reduce the complex-ity in implementation. They are not strictly necessaryin theory. The Graphical mApping Format (GAF)

As there are no text formats for sequence-to-graphalignment, we propose a new Graphical mApping For-mat (GAF) by extending the Pairwise mApping For-

SeedsSeeding &linear chaining ReferencegraphQuery sequenceLocal graph alignmentbetween linear chainsGraph chaining Sort by query position Align asm 2to graph 1Constructgraph 1/2Align asm 3to graph 1/2Constructgraph 1/2/3Graph 1(asm 1):Coarsegraph 1/2:Coarsegraph 1/2/3: from asm 2from asm 1 a b

Figure 2 Minigraph algorithms. (a)

Diagram of theminigraph mapping algorithm. Minigraph seeds alignmentswith minimizers, ﬁnds good enough linear chains, connectsthem in the graph and seeks the most weighted path as agraph chain. (b)

Diagram of incremental graph construction.A graph is iteratively constructed by mapping each assemblyto an existing graph and augmenting the graph with longpoorly mapped sequences in the assembly. mat (PAF) [35]. GAF is TAB-delimited with each col-umn deﬁned in Table 1. Column 6 encodes a path onthe graph. It follows the formal grammar below: <- | + <- (‘>’ | ‘<’) ( | ) <- ‘:’ ‘-’

In this grammar, is a segment identiﬁer onan S-line in rGFA; is a stable sequencename at the SN tag on the corresponding S-line. Col-umn 6 can be either a path in the segment coordinate(Fig. 1c) or an equivalent path in the stable coordinate(Fig. 1d). We can merge adjacent stable coordinates ifthe two segments are originated from the same sta-ble sequence and the end oﬀset of the ﬁrst segmentis equal to the start oﬀset of the second segment. Forexample, “ >chr1:0-5>chr1:5-8 ” can be simpliﬁed to“ >chr1:0-8 ”. Furthermore, if a path in column 6 isderived from one reference sequence, we recommendto replace it with the entire reference path on the for-ward orientation (e.g. see “read1” in Fig. 1d). Withthis convention, a GAF line is reduced to PAF for asequence mapped to a reference sequence. Similar toPAF, GAF also allows optional tags in the SAM-likeformat. Base alignment is kept at the cg tag.Minigraph produces GAF in both the segment andthe stable coordinate. GraphAligner [38] producesGAF in the segment coordinate only, which can beconverted to the stable coordinate. i et al. Page 4 of 12

Sequence-to-graph mapping

Our incremental graph construction algorithm relieson genome-to-graph alignment (Fig. 2b). As existingsequence-to-graph aligners [31, 38] do not work withchromosome-long query sequences, we adapted min-imap2 [39] for our purpose and implemented minigraph(Fig. 2a). Brieﬂy, minigraph uses a minimap2-like al-gorithm to ﬁnd local hits to segments in the graph,ignoring the graph topology. It then chains these lo-cal hits if they are connected on the graph, possiblythrough cycles. This gives the approximate mappinglocations. Minigraph does not perform base-level align-ment. This is because the graph we construct encodesSVs and rarely contains paths similar at the base level.The best mapping is often clear without base align-ment.To evaluate the accuracy of minigraph mapping,we simulated PacBio reads from GRCh38 with PB-SIM [40] and mapped them to the graph we con-structed in the next section. Table 2 compares theperformance of minigraph and GraphAligner [38]v1.0.10 on 68,857 simulated reads mapped over 8 CPUthreads. Minigraph is faster than GraphAligner anduses less memory, partly because minigraph does notperform base alignment.As is shown in Table 2, minigraph is more accu-rate than GraphAligner. This is counter-intuitive giventhat GraphAligner does base alignment. Close inspec-tion reveals that most mismapped reads by minigraphare mapped to the correct genomic loci but wronggraph paths. On the contrary, most mismapped readsby GraphAligner are mapped to wrong genomic loci.This suggests minigraph is better at ﬁnding approx-imate mapping locations but GraphAligner is betterat disambiguating similar graph paths. Combining thestrength of both could lead to a better graph map-per. We do plan to implement base-level alignment inminigraph in future.We have also tried vg v1.21.0. It indexed the samegraph in 14.7 wall-clock hours and mapped the simu-lated reads in 1.8 hours over 8 threads, tens of timesslower than minigraph and GraphAligner. However, noreads are mapped in the output. We have not been ableto make vg work with our data.

Generating pangenome graphs

Fig. 2b shows how minigraph constructs a pangenomegraph (see Methods for details). This procedure is sim-

Table 2

Performance of sequence-to-graph mappingminigraph GraphAlignerIndexing time (wall-clock sec) 100 589Mapping time (wall-clock sec) 79 140Peak RAM (GB) 19.5 27.2Percent unmapped reads 0.5% 0%Percent wrong mappings 1.7% 4.6%

Table 3

Assemblies used for graph constructionName Species Population Accession/SourceCHM1 Human N/A GCA 001297185.1CHM13 Human N/A GCA 000983455.1NA12878 Human European [43], phasedNA24385 Human Jewish [43], phasedPGP1 Human N/A [43], phasedNA19240 Human African GCA 001524155.4HG00514 Human East Asian GCA 002180035.3HG01352 Human American GCA 002209525.2NA19434 Human African GCA 002872155.1HG02818 Human African GCA 003574075.1HG03486 Human African GCA 003086635.1HG03807 Human South Asian GCA 003601015.1HG00733 Human American GCA 002208065.1HG02059 Human East Asian GCA 003070785.1HG00268 Human European GCA 008065235.1HG04217 Human South Asian GCA 007821485.1AK1 Human East Asian GCA 001750385.1Clint Chimpanzee GCA 002880755.3Susie Gorilla GCA 900006655.3Kamilah Gorilla GCA 008122165.1Susie Orangutan GCA 002880775.3 ilar to multiple sequence alignment via partial ordergraph [41] except that minigraph works with cyclicgraphs and ignores small variants. Minigraph only con-siders SVs of 100bp–100kb in length and ignores SVsin alignments shorter than 100kb. For each input as-sembly, it ﬁlters out regions covered by two or moreprimary alignments longer than 20kb in the assembly.This ﬁlter avoids paralogous regions in a sample andguarantees that graphs generated by minigraph can bemodeled by rGFA.As a sanity check, we compared minigraph to dip-call (https://github.com/lh3/dipcall) on calling SVs100bp or longer from a synthetic diploid sample com-posed of CHM1 and CHM13 [4]. Given two SV callsets A and B , we say a call in A is missed in callset B ifthere are no calls in B within 1000bp from the callin A . With this criterion, 2.7% of 14,792 SVs calledby dipcall are missed by minigraph; 6.0% of 14,932minigraph SVs are missed by dipcall. We manually in-spected tens of diﬀerences in IGV [42] and identiﬁedtwo causes. First, an INDEL longer than 100bp calledby one caller may be split into two shorter INDELs bythe other caller. There are often more than one smallerSVs around a missed SV call. Second, dipcall skips re-gions involving high density of SNPs or involving bothlong insertions and long deletions, but minigraph con-nects these events and calls SVs in such regions. Ittends to call more SVs. Overall, we believe minigraphand dipcall found similar sets of SVs. A human pangenome graph

Starting with GRCh38, we constructed a humanpangenome graph from 20 human haplotypes orhaplotype-collapsed assemblies (Table 3). It took min-igraph 2.7 wall-clock hours over 24 CPU threads to i et al. Page 5 of 12 A l u L 1 S V A E R V M i x e d - i n t e r . S a t e lli t e V N T R S T R O t h e r- L C R M i x e d -r e p e a t P a r t i a l -r e p e a t N o n -r e p - u n i q N o n -r e p - d u p v a r i a t i on s i n t he hu m an g r aph chr1 VNTRchr2 Interspersed repeatchr3 Partial/non-repeatchr4chr5chr6chr7chr8chr9chr10chr11chr12chr13chr14chr15chr16chr17chr18chr19chr20chr21chr22chrX a b A l u L 1 S V A E R V M i x e d - i n t e r . S a t e lli t e V N T R S T R O t h e r- L C R M i x e d -r e p e a t P a r t i a l -r e p e a t N o n -r e p - u n i q N o n -r e p - d u p v a r i a t i on s i n t he g r ea t ape g r aph A l u L 1 S V A E R V M i x e d - i n t e r . S a t e lli t e V N T R S T R O t h e r- L C R M i x e d -r e p e a t P a r t i a l -r e p e a t N o n -r e p - u n i q N o n -r e p - d u p b i a ll e li c v a r i a t i on s GRCh38 with the long alleleGRCh38 with the short allele A l u L 1 S V A E R V M i x e d - i n t e r . S a t e lli t e V N T R S T R O t h e r- L C R M i x e d -r e p e a t P a r t i a l -r e p e a t N o n -r e p - u n i q N o n -r e p - d u p hu m an - s pe c i f i c b i a ll e li c v a r i a t i on s GRCh38 with the long alleleGRCh38 with the short allele c d A l u L 1 S V A E R V M i x e d - i n t e r . S a t e lli t e V N T R S T R O t h e r- L C R M i x e d -r e p e a t P a r t i a l N o n -r e p - u n i q N o n -r e p - d u p Leng t h o f t he l onge s t a ll e l e fe Figure 3 Characteristics of the human and the great ape graphs. (a)

Human variations stratiﬁed by repeat class and by thenumber of alleles of each variation. The repeat annotation was obtained from the longest allele of each variation. VNTR:variable-number tandem repeat, a tandem repeat with the unit motif length ≥ ≤ ≥ (b) Great ape variations stratiﬁed by repeat class and by the number of alleles. (c)

Human biallelic variations stratiﬁed by repeat classand by insertion to/deletion from GRCh38. Both alleles are required to be covered in all assemblies. (d)

Human-speciﬁc biallelicvariations stratiﬁed by repeat class and by insertion to/deletion from GRCh38. Red bars correspond to insertions to the humanlineage. (e)

Distribution of diﬀerent types of human variations along chromosomes. (f)

Boxplot of the longest allele length in eachrepeat class. Outliers are omitted for the clarity of the ﬁgure.i et al.

Page 6 of 12

Figure 4 IGV screenshot of a region enriched with longinsertions.

Numbers on wide purple bars indicate insertionlengths. CLR: PacBio noisy continuous long reads. HiFi:PacBio high-ﬁdelity reads. generate this graph. The peak memory is 98.1GB.The resulting graph consists of 148,618 segments and214,995 links. It contains 37,332 variations, where a variation denotes a minimal subgraph that has a sin-gle source and a single sink with both segments comingfrom GRCh38. A path through the bubble between thesource and and the sink represents an allele .Variations in the human graph are enriched withAlus and VNTRs (Fig. 3a). While interspersed repeatsare about evenly distributed along chromosomes ex-cept in the pseudoautosomal regions (Fig. 3e), VN-TRs are enriched towards telomeres [6]. It is worthnoting the density of minisatellites is also higher insubtelomeres. If we normalize the density of VNTRsin the pangenome graph by the density of minisatel-lites in GRCh38, the enrichment of VNTRs towardstelomeres is still visible but becomes less prominent.At the same time, repeat-less variations are also en-riched towards the ends of chromosomes (green areasin Fig. 3e), suggesting subtelomeres tend to harborSVs anyway. We also identiﬁed 85 processed pseudo-genes among these variations.Another noticeable feature of VNTRs is that overhalf of VNTR variations are multiallelic (Fig. 3a). Fig. 4 shows a multi-allelic region composed of VN-TRs. We can see many insertions of diﬀerent lengths.The two diﬀerent NA12878 assemblies also disagreewith each other, which we often see around otherVNTR loci in NA12878 as well. We have not inspectedraw reads in this particular example, but we tend tobelieve the disagreement is caused by local misassem-blies rather than somatic mutations. In addition, dueto the multiallelic nature of such VNTRs, the twohaplotypes in a human individual are often diﬀerent.Assemblies mixing the two haplotypes (aka collapsedassemblies) may have more troubles in these regions.Multiallelic VNTRs are hard to assemble correctly.Multiallelic VNTRs are also hard to align and to call.In Fig. 4, the insertion positions are often diﬀerent,which could be caused by a few mutations or sequenc-ing errors. A naive alignment-based SV caller wouldcall a dozen of low-frequency insertions in this region,which does not reﬂect these correlated events. With-out base-level alignment, minigraph may have moretroubles with obtaining the optimal alignment in thesecomplex VNTR regions. Improved data quality, assem-bly algorithms and graph mapping algorithms are re-quired to investigate VNTR regions in detail.

A great ape pangenome graph

We also constructed a great ape pangenome graphfrom GRCh38, one chimpanzee, two gorillas and oneorangutan (Table 3). This graph contains 206,452variations, over four times more than the humangraph. About half of variations are originated fromorangutan, the species most distant from human.In the great ape graph, the L1-to-Alu ratio is closeto 1:1, much higher than the ratio in the humangraph (Fig. 3b vs Fig. 3a). This is perhaps correlatedwith the elevated L1 activity in great apes [44]. Ofretrotransposon-related variations speciﬁc to the hu-man lineage, the overwhelming majority are insertions(Fig. 3d), which is expected as transpositions lead toinsertions only. Most human-speciﬁc Alu deletions areincomplete and involve ancient Alu subfamilies. Theyare likely genomic deletions that happen to hit Alus.In contrast, the majority of “partial-repeats” are dele-tions from the human lineage. Two thirds of autosomalinsertions in this category are segmental duplicationsin GRCh38. In all, minigraph is an eﬃcient tool tostudy closely related species.

Blacklist regions from human pangenome graphs

The human pangenome graph eﬀectively encodes SVs ≥ i et al. Page 7 of 12

We constructed a human pangenome graph exclud-ing CHM1 and CHM13, the two samples used in theSynDip benchmark [4], and generated regions aroundvariations (see Methods), which we call as blacklist re-gions , following the rationale in [45]. Blacklist regionsis totaled 29.2Mb in length, intersecting 0.7% of con-ﬁdent regions in SynDip [4]; 0.7% of truth SNPs arecontained in blacklist regions – true SNPs are not en-riched in blacklist regions.We mapped short reads used in [4] with minimap2and called variants with GATK v4.1.2 [46]. This callsetcontains 32,879 false positive SNPs, 21% of which fallin blacklist regions – false SNP calls are highly enrichedin this <

1% region of human genome. This conﬁrms anoticeable fraction of false SNP calls using short readsare resulted from misalignment involving SVs.

Discussion

Based on the GFA assembly format [35], we proposedthe rGFA format, which deﬁnes a data model for refer-ence pangenome graphs at the same time. rGFA takesa linear reference genome as the backbone and main-tains the conceptual “linearity” of input genomes.rGFA is not the only pangenome graph model.Vg [31] encodes a stable sequence with a path throughthe sequence graph [47]. A segment in the graph mayoccur on multiple paths, or occur multiple times onone path if there are cycles in the graph. This way, vgallows diﬀerent regions in one chromosome collapsedto one segment. We call such a graph as a collapsedgraph. rGFA cannot encode a collapsed graph. The vgmodel is thus more general.In our view, however, the reference pangenome graphshould not be a collapsed graph. In a collapsed graph,the deﬁnition of orthology is not clear because multi-ple sequences from the same sample may go throughthe same segment. Without the concept of orthology,we cannot deﬁne variations, either. In addition, due tothe one-to-many relationship between segments andthe reference genome, it is intricate to derive the sta-ble coordinate of a path in a collapsed graph. Forexample, suppose segment s1 corresponds to two re-gions chr1:100-200 and chr1:500-600 . To convert a path s2 → s1 → s3 to the stable coordinate, we have to inspectadjacent segments to tell which s1 corresponds to; thisbecomes more challenging when s2 and s3 representmultiple regions in the reference genome. In contrast,rGFA inherently forbids a collapsed graph and avoidsthe potential issues above. This makes rGFA simplerthan vg’s path model and easier to work with.To demonstrate practical applications of rGFA,we developed minigraph to incrementally generatepangenome graphs. It can generate a graph from 20genomes in three hours and can scale to hundreds of genomes in future. A limitation of minigraph is that itdoes not perform base alignment and may be confusedby similar paths in the graph. We will implement basealignment to improve the mapping accuracy in suchcases. Another limitation of minigraph is that it is un-able to align sequences against a graph encoding allsmall variants. Such a graph will be composed of mil-lions of short segments. Not indexing minimizers acrosssegments, minigraph will fail to seed the initial linearchains. This limitation can only be resolved by com-pletely changing the minigraph mapping algorithm.Nonetheless, small variants are easier to analyze withthe standard methods. Incorporating these variantsunnecessarily enlarges the graph, complicates imple-mentations, increases the rate of false mappings [48]and reduces the performance of common tasks. Thereis also no known algorithm that can construct such acomplex graph for hundreds of human genomes.Minigraph does not keep track of the sample infor-mation as of now. To address this issue, we are consid-ering to implement colored rGFA, similar to coloredde Bruijn graphs [20]. In a colored rGFA, a color rep-resents one sample. Each segment or link is associatedwith one or multiple colors, indicating the sources ofthe segment or the link. Colors can be stored in anrGFA tag or in a separate segment/link-by-sample bi-nary matrix [22]. The matrix representation may bemore compact given a large number of samples.We have shown minigraph can be a fast and powerfulresearch tool to summarize SVs at the population scaleand to study the evolution of closely related species. Amore practical question is how a reference pangenomegraph may inﬂuence routine data analysis. Here is ourlimited view.We think a critical role a reference graph plays isthat it extends the coordinate system of a linear ref-erence genome. This allows us to annotate variationsin highly diverse regions such as the human HLA andKIR regions. The existing pipelines largely ignore thesevariations because most of them cannot be encoded inthe primary assembly of GRCh38.The extended graph coordinate system further helpsto consistently represent complex SVs. Given multi-ple samples, the current practice is to call SVs fromindividual samples and then merge them. Two sub-tly diﬀerent SVs, especially long insertions, may becalled at two distinct locations and treated as sepa-rate events. With the minigraph procedure, the twoSVs are likely to be aligned together as long as theyare similar to each other and are suﬃciently diﬀerentfrom the reference allele. To some extent, minigraphis performing multiple sequence alignment with par-tial order alignment [41]. This procedure is more ro-bust to diﬀerent representations of the same SV than i et al. Page 8 of 12 naive merging. When we refer to a SNP, we often useits chromosomal coordinate such as “chr1:12345”. Werarely do so for SVs because their positions are sensi-tive to alignment and SV callers. The more consistentSV representation implied by a pangenome graph willhelp to alleviate the issue and subsequently facilitatethe genotyping of SVs [49, 34, 50].While we believe a reference pangenome graph willmake complex variations more accessible by geneti-cists and biologists, we suspect a great majority ofbiomedical researchers will still rely on a linear ref-erence genome due to the conceptual simplicity of lin-ear genomes and the mature tool chains developed indecades. Many analyses such as SNP calling in well be-haved regions do not beneﬁt much from a pangenomerepresentation, either. Nonetheless, a pangenome ref-erence still helps applications based on linear refer-ences. With a graph reference, we may blacklist re-gions enriched with SVs that lead to small variantcalling errors. We may potentially generate “decoy”sequences that are missing from the primary assemblyto attract falsely mapped reads away. We may performread alignment against a graph, project the alignmentto the linear coordinate and ﬁnish the rest of analysesin the linear space. We anticipate a pangenome refer-ence to supplement the linear reference, not to replaceit.

Conclusions

Complex human sequence variations are like genomicdark matter: they are pervasive in our genomes butare often opaque to the assay with the existing tools.We envision a pangenome graph reference will becomean eﬀective means to the study of these complex vari-ations. We proposed a data model (rGFA), designedformats (rGFA and GAF) and developed companiontools (minigraph and gfatools) to demonstrate the fea-sibility of our vision. Our work is still preliminary butit is likely to set a starting point to the developmentof the next-generation graph-based tools, which mayultimately help us to understand our genomes better.

Methods

The minigraph mapping algorithm

Seeding and linear chaining

Similar to minimap2, minigraph uses minimizers onsegments as seeds. It also applies a similar chainingalgorithm but with diﬀerent scoring and with a newheuristic to speed up chaining over long distances. Forthe completeness of this article, we will describe partof the minimap2 chaining algorithm here.

Minimap2-like chaining

Formally, an anchor is a 3-tuple ( x, y, w ), representing a closed interval [ x − w +1 , x ] on a segment in the reference graph matching aninterval [ y − w + 1 , y ] on the query. Given a list ofanchors sorted by x , let f ( i ) be the maximal chainingscore up to the i -th anchor in the list. f ( i ) can becomputed by: f ( i ) = max (cid:8) max i>j ≥ { f ( j ) + α ( j, i ) − β ( j, i ) } , w i (cid:9) (1)where α ( j, i ) = min (cid:8) min { y i − y j , x i − x j } , w i (cid:9) is thenumber of matching bases between anchor i and j . β ( j, i ) is the gap penalty. Let g ji = | ( y i − y j ) − ( x i − x j ) | be the gap length and d ji = min { y i − y j , x i − x j } be thesmaller distance between the two anchors. Minigraphuses the following gap cost: β ( j, i ) =  ∞ ( g ji > G ) c · g ji + c · d ji + log g ji (0 < g ji ≤ G )0 ( g ji = 0)where G = 100000 in the graph construction mode, c = e − dw and c = 0 . · e − dw . By default, d = 0 . w =19 is the minimizer length. In comparison, minimap2applies G = 5000, c = 0 .

19 and c = 0. Minigraphallows much larger gaps between minimizers and moreheavily penalizes gaps.Solving Eq. 1 leads to an O ( n ) algorithm where n is the number of anchors. This algorithm is slow forlarge n . Minimap2 introduces heuristics to speed upthe computation by approximating this equation. Itworks well for minimap2 that only allows small gapsand has base-level alignment as a ﬁx to chaining errors.However, as minigraph intends to chain much longergaps, the minimap2 algorithm occasionally misses theoptimal alignment in long segmental duplications andproduces false variations. Minigraph introduces a newheuristic to speed up chaining. Dynamic 1-dimension Range-Min-Query

Before wemove onto the minigraph solution, we will ﬁrst intro-duce Range-Min-Query (RMQ). Given a set of 2-tuples { ( y i , s i ) } , RMQ( a, b ) returns the minimum s j among { s j : a ≤ y j ≤ b } . We implemented 1-dimension RMQwith a modiﬁed AVL tree, a type of balanced binarysearch tree (Fig. 5). When performing RMQ( a, b ), weﬁrst ﬁnd the smallest and the largest nodes within in-terval [ a, b ] using the standard algorithm. In this exam-ple, the two nodes are (21,32) and (45,21), respectively.We then traverse the path between the two nodes toﬁnd the minimum. With a balanced tree structure, wedo not need to descend into subtrees. The time com-plexity is O ( m log m ), where m is the number of nodes i et al. Page 9 of 12

Figure 5 Implementing 1-dimension Range-Min-Query(RMQ).

Given a set of 2-tuples, a binary search tree is builtfor the ﬁrst values in the tuples. Each node p in the tree isassociated with a pointer. The pointer points to the node thatis in the subtree descended from p and has the minimal secondvalue. In this example, RMQ(20 ,

50) = 14 . in the tree. We can insert nodes to or delete nodesfrom the tree while maintaining the property of thetree. This achieves dynamic RMQ. Chaining with a linear gap cost function

A linear gapcost takes the form of β (cid:48) ( j, i ) = c [( y i − y j )+( x i − x j )].Given a list of anchors ( x i , y i , w i ) sorted by position x i , let f (cid:48) ( i ) = max i > j ≥ x i − G ≤ x j ≤ x i − w i y i − G ≤ y j ≤ y i − w i (cid:8) f (cid:48) ( j ) + w j − β (cid:48) ( j, i ) (cid:9) (2)We can ﬁnd the optimal f (cid:48) ( i ) in O ( n log n ) time withRMQ [51, 52]. To see that, deﬁne h (cid:48) ( j ) = f (cid:48) ( j ) + w j + c ( y j + x j )The following condition f (cid:48) ( j ) + w j − β (cid:48) ( j, i ) > f (cid:48) ( k ) + w k − β (cid:48) ( k, i )is equivalent to h (cid:48) ( j ) > h (cid:48) ( k ), independent of i . Ifwe maintain RMQ i as the binary tree that keeps { ( y j , − h (cid:48) ( j )) : j < i, x i − G ≤ x j ≤ x i − w i } , wehave f (cid:48) ( i ) = − RMQ i ( y i − G, y i − w i ) − c ( x i + y i )This solves Eq. 2 in O ( n log n ) time. Minigraph linear chaining

While chaining with a lin-ear gap cost function can be solved eﬃciently, we pre-fer more realistic cost function used in minimap2. Inpractical implementation, when we come to anchor i ,we ﬁnd the optimal predecessor j ∗ under the desiredgap cost β ( j, i ) for anchors { j : j < i, x i − G (cid:48) ≤ x j

Minigraph generates a set of linear chains { L i } withthe procedure above that completely ignores the graphtopology. It then applies another round of chainingtaking the account of the topology.We say linear chain L i precedes L j , written as L i ≺ L j , if (1) the ending coordinate of L i on the querysequence is smaller than the ending coordinate of L j ,and (2) there is a walk from L i to L j in the graph.If there are multiple walks from L i to L j , minigraphenumerates the shortest 16 walks and chooses the walkwith its length being the closest to the query distancebetween L i and L j .Given a list of linear chains sorted by their endingcoordinates on the query sequence, let g ( i ) be the op-timal graph chaining score up to linear chain L i . Wecan compute g ( i ) with another dynamic programming: g ( i ) = max (cid:8) max L j ≺ L i { g ( j ) + ω ( L j ) − β ( j, i ) } , ω ( L i ) (cid:9) where β ( j, i ) is the weight between L i and L j . As min-igraph does not perform base-level alignment, β ( j, i ) isthe same as the gap penalty function used for linearchaining. ω ( L i ) is the optimal score of L i computedduring linear chaining.The procedure above has two limitations. First,when computing the weight between L i and L j , mini-graph largely ignores base sequences and only consid-ers the distance between them on both the query andthe graph. When there are multiple walks of similarlengths between L i and L j , minigraph miss the graphchain that leads to the best base alignment. Althoughwe added a heuristic by considering 17-mer matchesbetween the query and the graph paths, we found thisheursitc is not reliable in complex regions. Second,minigraph only enumerates the shortest 16 walks. Incomplex subgraphs, the optimal walk from L i to L j may not be among them. In the graphs produced byminigraph, these two limitations only have minor ef-fects, but anyway in future, we will implement base i et al. Page 10 of 12 alignment. We may use the current minigraph algo-rithm for easy cases and apply the more expensive basealignment when the current algorithm potentially fails.The graph chaining algorithm results in one or mul-tiple graph chains. A graph chain is a list of anchors( s i , x i , y i , w i ), where [ x i − w i + 1 , x i ] on segment s i in the graph matches [ y i − w i + 1 , y i ] on the querysequence. A graph chain satisﬁes the following condi-tions: if i < j , y i < y j ; if i < j and s i = s j , we have x i < x j ; if s i (cid:54) = s i +1 , the two segments are adjacenton the graph. It is an extension to linear chains. The minigraph graph generation algorithm

Using the minimap2 algorithm [39], minigraph identi-ﬁes a set of primary chains that do not greatly overlapwith each other on the query sequence. A region on thequery is considered to be orthogonal to the reference ifthe region is contained in a primary chain longer than100kb and it is not intersecting other primary chainslonger than 20kb.Minigraph scans primary chains in orthogonal re-gions and identiﬁes subregions where the query sub-sequences signiﬁcantly diﬀers from the correspondingreference subsequences. To achieve that, minigraphcomputes a score h i for each adjacent pair of anchors( s i , x i , y i , w i ) and ( s i +1 , x i +1 , y i +1 , w i +1 ). Let d xi bethe distance between the two anchors on the graphand d yi = y i +1 − y i be the distance on the query se-quence. h i is computed as h i = (cid:26) −

10 if d xi = d yi ≤ w i +1 η · max { d xi , d yi } otherwise (3)where η is the density of anchors averaged across allprimary graph chains. Deﬁne H ( i, j ) = (cid:80) jk = i h k . Ahighly divergent region between the query and thegraph will be associated with a large H ( i, j ). Mini-graph uses the Ruzzo-Tompa algorithm [53] to identifyall maximal scoring intervals on list ( h i ), which corre-spond to divergent regions. In each identiﬁed divergentregion, minigraph performs base alignment [54, 39] be-tween the query and the graph sequences and retainsa region if it involves an INDEL ≥ ≥ Annotating variations

We applied RepeatMasker [55] v1.332 to classify in-terspersed repeats in the longest allele sequence ofeach variation. RepeatMasker is unable to anno-tate VNTRs with long motifs. It also often inter-prets VNTRs as impure STRs. Therefore, we did not use the RepeatMasker VNTR or STR annota-tions directly. Instead, we combined RepeatMaskerand SDUST [56] results to collect low-complexity re-gions (LCRs). We identiﬁed pure tandem repeats com-posed of a motif occurring twice or more (implementedin https://github.com/lh3/etrf). An LCR is classiﬁedas VNTR if 70% of the LCR is VNTR; similarly, anLCR is classiﬁed as STR if 70% is STR; the rest areclassiﬁed as “Other-LCR” in Fig. 3. The annotationscript is available in the minigraph GitHub repository.

Creating blacklist regions

For each variation in the graph, we extend its ge-nomic interval on GRCh38 by 50bp from each end.We name this set of intervals as I . We align sequencesinserted to GRCh38 against GRCh38 with “minimap2-cxasm20 -r2k” and ﬁlter out alignments with mappingquality below 5. Let I ( a, b ) be the set of GRCh38 in-tervals that are contained in alignments with identitybetween a and b . The blacklist regions are computedby I ∪ I (0 , . \ I (0 . , ∪ ” denotes theinterval union operation and “ \ ” denotes interval sub-traction. Competing interests

The authors declare that they have no competing interests.

Author’s contributions

HL conceived the project, developed minigraph and drafted the manuscript.XF did the pseudogene analysis. CC helped with RepeatMasker annotation.All authors helped to revise the manuscript.

Acknowledgements

We are grateful to Benedict Paten and Erik Garrison for discussions onpangenome graphs. We thank minigraph users who have suggested featuresand helped to ﬁx various issues.

Funding

This work is supported by National Institutes of Health (NIH) grantU01HG010961 and R01HG010040.

Availability of data and materials

Minigraph is openly available at https://github.com/lh3/minigraph. Thisrepository also includes the script to convert from the segment coordinateto the stable coordinate, to annotate variations and to generate blacklistregions from the graph. The companion gfatools is available athttps://github.com/lh3/gfatools. The human and the great ape graphs arehosted at ftp://ftp.dfci.harvard.edu/pub/hli/minigraph/. The NA12878,NA24385 and PGP1 phased assemblies were downloaded fromftp://ftp.dfci.harvard.edu/pub/hli/whdenovo/. Other assemblies are allavailable from GenBank.

Author details Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA02215, USA. Department of Biomedical Informatics, Harvard MedicalSchool, Boston, MA 02215, USA.

References

1. Schneider, V.A., Graves-Lindsay, T., Howe, K., Bouk, N., Chen, H.-C.,Kitts, P.A., Murphy, T.D., Pruitt, K.D., Thibaud-Nissen, F., Albracht,D., Fulton, R.S., Kremitzki, M., Magrini, V., Markovic, C., McGrath,S., Steinberg, K.M., Auger, K., Chow, W., Collins, J., Harden, G.,Hubbard, T., Pelan, S., Simpson, J.T., Threadgold, G., Torrance, J.,Wood, J.M., Clarke, L., Koren, S., Boitano, M., Peluso, P., Li, H.,Chin, C.-S., Phillippy, A.M., Durbin, R., Wilson, R.K., Flicek, P., i et al. Page 11 of 12

Eichler, E.E., Church, D.M.: Evaluation of GRCh38 and de novohaploid genome assemblies demonstrates the enduring quality of thereference assembly. Genome Res (5), 849–864 (2017).doi:10.1101/gr.213611.1162. Huddleston, J., Chaisson, M.J.P., Steinberg, K.M., Warren, W.,Hoekzema, K., Gordon, D., Graves-Lindsay, T.A., Munson, K.M.,Kronenberg, Z.N., Vives, L., Peluso, P., Boitano, M., Chin, C.-S.,Korlach, J., Wilson, R.K., Eichler, E.E.: Discovery and genotyping ofstructural variation from long-read haploid genome sequence data.Genome Res (5), 677–685 (2017). doi:10.1101/gr.214007.1163. Eichler, E.E., Flint, J., Gibson, G., Kong, A., Leal, S.M., Moore, J.H.,Nadeau, J.H.: Missing heritability and strategies for ﬁnding theunderlying causes of complex disease. Nature Reviews Genetics (6),446–450 (2010). doi:10.1038/nrg28094. Li, H., Bloom, J.M., Farjoun, Y., Fleharty, M., Gauthier, L., Neale, B.,MacArthur, D.: A synthetic-diploid benchmark for accuratevariant-calling evaluation. Nat Methods (8), 595–597 (2018).doi:10.1038/s41592-018-0054-75. Wenger, A.M., Peluso, P., Rowell, W.J., Chang, P.-C., Hall, R.J.,Concepcion, G.T., Ebler, J., Fungtammasan, A., Kolesnikov, A.,Olson, N.D., et al.: Accurate circular consensus long-read sequencingimproves variant detection and assembly of a human genome. NatureBiotechnology (10), 1155–1162 (2019).doi:10.1038/s41587-019-0217-96. Audano, P.A., Sulovari, A., Graves-Lindsay, T.A., Cantsilieris, S.,Sorensen, M., Welch, A.E., Dougherty, M.L., Nelson, B.J., Shah, A.,Dutcher, S.K., Warren, W.C., Magrini, V., McGrath, S.D., Li, Y.I.,Wilson, R.K., Eichler, E.E.: Characterizing the major structural variantalleles of the human genome. Cell (3), 663–67519 (2019).doi:10.1016/j.cell.2018.12.0197. Computational Pan-Genomics Consortium: Computationalpan-genomics: status, promises and challenges. Brief Bioinform (1),118–135 (2016). doi:10.1093/bib/bbw0898. M¨akinen, V., Navarro, G., Sir´en, J., V¨alim¨aki, N.: Storage and retrievalof highly repetitive sequence collections. J Comput Biol (3),281–308 (2010). doi:10.1089/cmb.2009.01699. Liu, B., Zhu, D., Wang, Y.: deBWT: parallel construction ofBurrows–Wheeler Transform for large collection of genomes with debruijn-branch encoding. Bioinformatics (12), 174–182 (2016).doi:10.1093/bioinformatics/btw26610. Boucher, C., Gagie, T., Kuhnle, A., Langmead, B., Manzini, G., Mun,T.: Preﬁx-free parsing for building big BWTs. Algorithms for MolecularBiology (1) (2019). doi:10.1186/s13015-019-0148-511. Na, J.C., Kim, H., Park, H., Lecroq, T., L´eonard, M., Mouchard, L.,Park, K.: FM-index of alignment: A compressed index for similarstrings. Theoretical Computer Science , 159–170 (2016).doi:10.1016/j.tcs.2015.08.00812. Vernikos, G., Medini, D., Riley, D.R., Tettelin, H.: Ten years ofpan-genome analyses. Curr Opin Microbiol , 148–54 (2015).doi:10.1016/j.mib.2014.11.01613. Marcus, S., Lee, H., Schatz, M.C.: SplitMEM: a graphical algorithmfor pan-genome analysis with suﬃx skips. Bioinformatics (24),3476–83 (2014). doi:10.1093/bioinformatics/btu75614. Baier, U., Beller, T., Ohlebusch, E.: Graphical pan-genome analysiswith compressed suﬃx trees and the Burrows–Wheeler transform.Bioinformatics (4), 497–504 (2015).doi:10.1093/bioinformatics/btv60315. Beller, T., Ohlebusch, E.: A representation of a compressed de Bruijngraph for pan-genome analysis that enables search. Algorithms MolBiol , 20 (2016). doi:10.1186/s13015-016-0083-716. Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.:On the representation of de bruijn graphs. J Comput Biol (5),336–52 (2015). doi:10.1089/cmb.2014.016017. Minkin, I., Pham, S., Medvedev, P.: TwoPaCo: an eﬃcient algorithmto build the compacted de Bruijn graph from many complete genomes.Bioinformatics (24), 4024–4032 (2017).doi:10.1093/bioinformatics/btw60918. Chikhi, R., Limasset, A., Medvedev, P.: Compacting de Bruijn graphsfrom sequencing data quickly and in low memory. Bioinformatics (12), 201–208 (2016). doi:10.1093/bioinformatics/btw27919. Almodaresi, F., Pandey, P., Patro, R.: Rainbowﬁsh: A Succinct Colored de Bruijn Graph Representation. In: Schwartz, R., Reinert, K. (eds.)17th International Workshop on Algorithms in Bioinformatics (WABI2017). Leibniz International Proceedings in Informatics (LIPIcs), vol.88, pp. 18–11815. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik,Dagstuhl, Germany (2017). doi:10.4230/LIPIcs.WABI.2017.1820. Iqbal, Z., Caccamo, M., Turner, I., Flicek, P., McVean, G.: De novoassembly and genotyping of variants using colored de bruijn graphs.Nat Genet (2), 226–32 (2012). doi:10.1038/ng.102821. Muggli, M.D., Alipanahi, B., Boucher, C.: Building large updatablecolored de bruijn graphs via merging. Bioinformatics (14), 51–60(2019). doi:10.1093/bioinformatics/btz35022. Holley, G., Melsted, P.: Bifrost – highly parallel construction andindexing of colored and compacted de bruijn graphs. bioRxiv (2019).doi:10.1101/69533823. Liu, B., Guo, H., Brudno, M., Wang, Y.: deBGA: read alignment withde Bruijn graph-based seed and extension. Bioinformatics (21),3224–3232 (2016). doi:10.1093/bioinformatics/btw37124. Almodaresi, F., Sarkar, H., Srivastava, A., Patro, R.: A space andtime-eﬃcient index for the compacted colored de bruijn graph.Bioinformatics (13), 169–177 (2018).doi:10.1093/bioinformatics/bty29225. Sir´en, J., Garrison, E., Novak, A.M., Paten, B., Durbin, R.:Haplotype-aware graph indexes. Bioinformatics (2019).doi:10.1093/bioinformatics/btz57526. Dilthey, A., Cox, C., Iqbal, Z., Nelson, M.R., McVean, G.: Improvedgenome inference in the MHC using a population reference graph.Nature Genetics (6), 682–688 (2015). doi:10.1038/ng.325727. Dilthey, A.T., Mentzer, A.J., Carapito, R., Cutland, C., Cereb, N.,Madhi, S.A., Rhie, A., Koren, S., Bahram, S., McVean, G., et al.:HLA*LA–HLA typing from linearly projected graph alignments.Bioinformatics (21), 4394–4396 (2019).doi:10.1093/bioinformatics/btz23528. Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E.,DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry,S.T., McVean, G., Durbin, R., 1000 Genomes Project Analysis Group:The variant call format and vcftools. Bioinformatics (15), 2156–8(2011). doi:10.1093/bioinformatics/btr33029. Eggertsson, H.P., Jonsson, H., Kristmundsdottir, S., Hjartarson, E.,Kehr, B., Masson, G., Zink, F., Hjorleifsson, K.E., Jonasdottir, A.,Jonasdottir, A., Jonsdottir, I., Gudbjartsson, D.F., Melsted, P.,Stefansson, K., Halldorsson, B.V.: Graphtyper enables population-scalegenotyping using pangenome graphs. Nat Genet (11), 1654–1660(2017). doi:10.1038/ng.396430. Rakocevic, G., Semenyuk, V., Lee, W.-P., Spencer, J., Browning, J.,Johnson, I.J., Arsenijevic, V., Nadj, J., Ghose, K., Suciu, M.C., et al.:Fast and accurate genomic analyses using genome graphs. NatureGenetics (2), 354–362 (2019). doi:10.1038/s41588-018-0316-431. Garrison, E., Sir´en, J., Novak, A.M., Hickey, G., Eizenga, J.M.,Dawson, E.T., Jones, W., Garg, S., Markello, C., Lin, M.F., Paten, B.,Durbin, R.: Variation graph toolkit improves read mapping byrepresenting genetic variation in the reference. Nat Biotechnol (9),875–879 (2018). doi:10.1038/nbt.422732. Sibbesen, J.A., Maretty, L., Danish Pan-Genome Consortium, Krogh,A.: Accurate genotyping across variant classes and lengths usingvariant graphs. Nat Genet (7), 1054–1059 (2018).doi:10.1038/s41588-018-0145-533. Biederstedt, E., Oliver, J.C., Hansen, N.F., Jajoo, A., Dunn, N., Olson,A., Busby, B., Dilthey, A.T.: NovoGraph: Human genome graphconstruction from multiple long-read de novo assemblies. F1000Res ,1391 (2018). doi:10.12688/f1000research.15895.234. Eggertsson, H.P., Kristmundsdottir, S., Beyter, D., Jonsson, H.,Skuladottir, A., Hardarson, M.T., Gudbjartsson, D.F., Stefansson, K.,Halldorsson, B.V., Melsted, P.: Graphtyper2 enables population-scalegenotyping of structural variation using pangenome graphs. NatureCommunications (1) (2019). doi:10.1038/s41467-019-13341-935. Li, H.: Minimap and miniasm: fast mapping and de novo assembly fornoisy long sequences. Bioinformatics (14), 2103–10 (2016).doi:10.1093/bioinformatics/btw15236. Pevzner, P.A., Tang, H., Waterman, M.S.: An eulerian path approachto dna fragment assembly. Proc Natl Acad Sci U S A (17), 9748–53(2001). doi:10.1073/pnas.171285098 i et al. Page 12 of 12

37. Gnerre, S., Maccallum, I., Przybylski, D., Ribeiro, F.J., Burton, J.N.,Walker, B.J., Sharpe, T., Hall, G., Shea, T.P., Sykes, S., Berlin, A.M.,Aird, D., Costello, M., Daza, R., Williams, L., Nicol, R., Gnirke, A.,Nusbaum, C., Lander, E.S., Jaﬀe, D.B.: High-quality draft assembliesof mammalian genomes from massively parallel sequence data. ProcNatl Acad Sci U S A (4), 1513–8 (2011).doi:10.1073/pnas.101735110838. Rautiainen, M., Marschall, T.: GraphAligner: Rapid and versatilesequence-to-graph alignment. bioRxiv (2019). doi:10.1101/81081239. Li, H.: Minimap2: pairwise alignment for nucleotide sequences.Bioinformatics (18), 3094–3100 (2018).doi:10.1093/bioinformatics/bty19140. Ono, Y., Asai, K., Hamada, M.: PBSIM: PacBio readssimulator–toward accurate genome assembly. Bioinformatics (1),119–21 (2013). doi:10.1093/bioinformatics/bts64941. Lee, C., Grasso, C., Sharlow, M.F.: Multiple sequence alignment usingpartial order graphs. Bioinformatics (3), 452–464 (2002).doi:10.1093/bioinformatics/18.3.45242. Robinson, J.T., Thorvaldsd´ottir, H., Winckler, W., Guttman, M.,Lander, E.S., Getz, G., Mesirov, J.P.: Integrative genomics viewer. NatBiotechnol (1), 24–6 (2011). doi:10.1038/nbt.175443. Garg, S., Fungtammasan, A., Carroll, A., Chou, M., Schmitt, A.,Zhou, X., Mac, S., Peluso, P., Hatas, E., Ghurye, J., Maguire, J.,Mahmoud, M., Cheng, H., Heller, D., Zook, J.M., Moemke, T.,Marschall, T., Sedlazeck, F.J., Aach, J., Chin, C.-S., Church, G.M., Li,H.: Eﬃcient chromosome-scale haplotype-resolved assembly of humangenomes. bioRxiv (2019). doi:10.1101/81034144. Mathews, L.M., Chi, S.Y., Greenberg, N., Ovchinnikov, I., Swergold,G.D.: Large diﬀerences between LINE-1 ampliﬁcation rates in thehuman and chimpanzee lineages. Am J Hum Genet (3), 739–48(2003). doi:10.1086/36827545. Amemiya, H.M., Kundaje, A., Boyle, A.P.: The ENCODE blacklist:Identiﬁcation of problematic regions of the genome. Sci Rep (1),9354 (2019). doi:10.1038/s41598-019-45839-z46. Depristo, M.A., Banks, E., Poplin, R., Garimella, K.V., Maguire, J.R.,Hartl, C., Philippakis, A.A., Del Angel, G., Rivas, M.A., Hanna, M.,McKenna, A., Fennell, T.J., Kernytsky, A.M., Sivachenko, A.Y.,Cibulskis, K., Gabriel, S.B., Altshuler, D., Daly, M.J.: A framework forvariation discovery and genotyping using next-generation dnasequencing data. Nat Genet (5), 491–8 (2011). doi:10.1038/ng.80647. Llamas, B., Narzisi, G., Schneider, V., Audano, P., Biederstedt, E.,Blauvelt, L., Bradbury, P., Chang, X., Chin, C., Fungtammasan, A.,Clarke, W., Cleary, A., Ebler, J., Eizenga, J., Sibbesen, J., Markello,C., Garrison, E., Garg, S., Hickey, G., Lazo, G., Lin, M., Mahmoud,M., Marschall, T., Minkin, I., Monlong, J., Musunuri, R., Sagayaradj,S., Novak, A., Rautiainen, M., Regier, A., Sedlazeck, F., Siren, J.,Souilmi, Y., Wagner, J., Wrightsman, T., Yokoyama, T., Zeng, Q.,Zook, J., Paten, B., Busby, B.: A strategy for building and using ahuman reference pangenome [version 1; peer review: 1 approved, 1approved with reservations]. F1000Research (1751) (2019).doi:10.12688/f1000research.19630.148. Pritt, J., Chen, N.-C., Langmead, B.: Forge: prioritizing variants forgraph genomes. Genome Biology (1) (2018).doi:10.1186/s13059-018-1595-x49. Hickey, G., Heller, D., Monlong, J., Sibbesen, J.A., Sir´en, J., Eizenga,J., Dawson, E.T., Garrison, E., Novak, A.M., Paten, B.: Genotypingstructural variants in pangenome graphs using the vg toolkit. GenomeBiology (1) (2020). doi:10.1186/s13059-020-1941-750. Chen, S., Krusche, P., Dolzhenko, E., Sherman, R.M., Petrovski, R.,Schlesinger, F., Kirsche, M., Bentley, D.R., Schatz, M.C., Sedlazeck,F.J., et al.: Paragraph: a graph-based structural variant genotyper forshort-read sequence data. Genome Biology (1) (2019).doi:10.1186/s13059-019-1909-751. Abouelhoda, M.I., Ohlebusch, E.: A local chaining algorithm and itsapplications in comparative genomics. In: Benson, G., Page, R.D.M.(eds.) Algorithms in Bioinformatics, Third International Workshop,WABI 2003, Budapest, Hungary, September 15-20, 2003, Proceedings,vol. 2812, pp. 1–16. Springer, ??? (2003).doi:10.1007/978-3-540-39763-2 152. Otto, C., Hoﬀmann, S., Gorodkin, J., Stadler, P.F.: Fast localfragment chaining using sum-of-pair gap costs. Algorithms Mol Biol , 4 (2011). doi:10.1186/1748-7188-6-453. Ruzzo, W.L., Tompa, M.: A linear time algorithm for ﬁnding allmaximal scoring subsequences. In: Lengauer, T., Schneider, R., Bork,P., Brutlag, D.L., Glasgow, J.I., Mewes, H., Zimmer, R. (eds.)Proceedings of the Seventh International Conference on IntelligentSystems for Molecular Biology, August 6-10, 1999, Heidelberg,Germany, pp. 234–241. AAAI, ??? (1999)54. Suzuki, H., Kasahara, M.: Introducing diﬀerence recurrence relationsfor faster semi-global alignment of long sequences. BMC Bioinformatics (Suppl 1), 45 (2018). doi:10.1186/s12859-018-2014-855. Tarailo-Graovac, M., Chen, N.: Using repeatmasker to identifyrepetitive elements in genomic sequences. Curr Protoc Bioinformatics Chapter 4 , 4–10 (2009)56. Morgulis, A., Gertz, E.M., Sch¨aﬀer, A.A., Agarwala, R.: A fast andsymmetric dust implementation to mask low-complexity dna sequences.J Comput Biol13