[PDF] AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

Abstract

As genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for a better understanding of the genome donor (e.g., health characteristics). Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome that represents the most relevant population. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping). However, if a read maps to a region in the old reference that does not appear with a reasonable degree of similarity in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for remapping alignments from one genome to another. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces 1) the number of reads that need to be fully mapped to the new reference by up to 99.99\% and 2) the overall execution time to remap read sets between two reference genome versions by 6.7x, 6.6x, and 2.8x for large (human), medium (C. elegans), and small (yeast) reference genomes, respectively. We validate our remapping results with GATK and find that AirLift provides similar accuracy in identifying ground truth SNP and INDEL variants as the baseline of fully mapping a read set.

Full PDF

AAirLift: A Fast and Comprehensive Techniquefor Translating Alignments between Reference Genomes

Jeremie S. Kim , , Can Firtina , Damla Senol Cali , Mohammed Alser , Nastaran Hajinazar , ,Can Alkan , ‡ , and Onur Mutlu , , , ‡ Department of Computer Science, ETH Zurich, Zurich 8006, Switzerland Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh 15213, PA, USA Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey Department of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada ‡ Corresponding authors: [email protected] and [email protected]

Abstract.

As genome sequencing tools and techniques improve, researchers are able to incrementallyassemble more accurate reference genomes. A more accurate reference genome enables increased ac-curacy in read mappings, which provides more accurate variant information and thus health data onthe donor. Therefore, read data sets from sequenced samples should ideally be mapped to the latestavailable reference genome. Unfortunately, the increasingly large amounts of available genomic datamakes it prohibitively expensive to fully map each read data set to its respective reference genomeevery time the reference is updated. Several tools that attempt to reduce the procedure of updating aread data set from one reference to another (i.e., remapping) have been published. These tools identifyregions of similarity across the two references and update the mapping locations of a read based on thelocations of similar regions in the new reference genome. The main drawback of existing approachesis that if a read maps to a region in the old reference without similar regions in the new reference, itcannot be remapped. We ﬁnd that, as a result of this drawback, a signiﬁcant portion of annotationsare lost when using state-of-the-art remapping tools. To address this major limitation in existing tools,we propose AirLift, a fast and comprehensive technique for moving alignments from one genome toanother. AirLift can reduce 1) the number of reads that need to be mapped from the entire read setby up to 99.9% and 2) the overall execution time to remap the reads between the two most recent ref-erence versions by 6.94 × , 44.0 × , and 16.4 × for large (human), medium (C. elegans), and small (yeast)references, respectively. Code Availability.

The AirLift source code is available at https://github.com/CMU-SAFARI/AirLift.

Keywords:

Genome Mapping, Genome Assembly, Remapping, Crossmapping, LiftOver

Reference genomes are inaccurate and do not perfectly represent the average healthy individual of thepopulation for a variety of reasons. First, reference genomes are derived primarily from individuals thatdo not necessarily represent the population and are missing a substantial amount of sequences [17, 23].Second, they are constructed using imperfect sequencing technologies that result in error-prone reads [16].Third, the resulting reads (i.e., read set ) are assembled into a reference genome using imperfect assemblytools [6, 25]. As genome sequencing tools and assembly algorithms improve, and as more sequenced samplesbecome available, researchers are able to incrementally assemble more accurate reference genomes. As anexample, the Genome Reference Consortium (GRC) reviews minor updates to the human reference genomefor release every three months and releases major updates every few years. These updates are critical tothe accuracy of the reference genome with the latest reference genome having the most complete and mostaccurate annotations and sequences. Therefore, the original locations that each read was likely sequencedfrom should be found (i.e., read mapping) using the latest reference genome of its species to maintain anaccurate downstream genome analysis [12] to obtain the most accurate health data on the sample.Currently, the best way to adapt an existing genomic study (i.e., read sets from many samples) to anew reference genome is to re-run the entire analysis pipeline using the new reference genome. For example,the original analysis of the 1000 Genomes Project was completed using human reference genome build 37 a r X i v : . [ q - b i o . GN ] D ec Authors Suppressed Due to Excessive Length (GRCh37) [3]. After the next version of the reference (GRCh38) became available, each read set was mappedagain to the new reference [35]. Unfortunately, this approach is computationally very expensive and does notscale to large genomic studies that include a large number of individuals for three key reasons. First, mappingeven a single read set is expensive [8,19] (e.g., 15 minutes for aligning 3,000,000 short reads or 0.1 × coverageof the human genome) and heavily relies on an expensive alignment algorithm. Second, the number ofsequenced samples (i.e., read sets) doubles approximately every 8 months [7, 30], and the rate of growth willonly increase as long and ultra long sequencing technologies enable building assemblies with better contiguity(e.g., Nanopore [22]) and become more cost eﬀective. Third, researchers are ﬁnding that reference genomesshould be more comprehensive in representing diverse populations and ethnic groups [4,5,13,21,23,24,31,32].This may lead to having multiple reference genomes (or alternate subsequences) representing the same speciesto which each read set must be mapped in order to correctly identify the genome variants.To reduce the overhead in fully mapping a read set to a new reference genome, several prior tools [9, 10,18,26,27,29,33,34] can be used to quickly update or remap the locations of the existing mappings of the reads(i.e., coordinates) in the original (old) reference genome to a diﬀerent (new) reference genome at the cost of coverage (i.e., the percentage of the new reference genome that reads map to). In the remainder of the paperwe collectively refer to such methods as remapping tools . Existing state-of-the-art remapping tools cannot provide high coverage when converting the mapping coordinates from one reference genome to another asthey do not account for reads that map to regions in the old reference that are repetitive or have signiﬁcantchanges in the new reference genome. We observe that because of this limitation, state-of-the-art remappingtools can miss up to 7% of gene annotations when remapping reads from human reference genome build 16to the latest human reference genome build (GRCh38). This limitation requires researchers to re-run the full genome analysis pipeline for each read set on an updated reference genome for a comprehensive study.Our goal is to provide a technique that substantially

1) reduces the execution time to remap a read setfrom an (old) reference genome to a (new) reference genome, and 2) provides high coverage by also accountingfor reads that map to regions that are repetitive or signiﬁcantly diﬀerent in the new reference genome. Tothis end, we propose

AirLift , a methodology that leverages the similarity between two reference genomes toreduce the execution time to map a read set from one reference genome to another while maintaining a highcoverage similar to fully mapping a read set to the new reference.The key idea is to exploit the similarity between a pair of reference genomes in order to identify thecheapest method for moving reads. AirLift selects the cheapest method depending on the region that the readaligns to in the old reference and how that region relates to the new reference. This is done by generatinglookup tables for each pair of reference genomes (old and new) that map locations of similar regions (withhigh error acceptance rates) between the references. The creation of these lookup tables is a one time eﬀortand once created, they can be reused for any amount of reads. AirLift then uses these lookup tables tocategorize all reads and remaps them accordingly.We evaluate AirLift by comparing against BWA-MEM [14] fully mapping a read set to the new referenceacross various versions of the human (i.e., GRCh37 and GRCh38), Caenorhabditis elegans (i.e., ce1, ce2,ce4, ce6, ce10, and ce10), and yeast (i.e., sacCer1, sacCer2, and sacCer3) reference genomes. Based on ourevaluation we provide two major results. First, we show that AirLift reduces the overall number of readsthat needs to be remapped from the original read set by up to 99.9%. Second, AirLift reduces the overallruntime required to remap the reads from the old reference genome to the new reference genome by 6.94 × ,44.0 × , and 16.4 × for large (human), medium (C. elegans), and small (yeast) reference genomes, respectively,when we use the reduced read set suggested by AirLift. We conclude that AirLift signiﬁcantly reduces theoverhead of updating the already mapped reads to a new reference genome while still accounting for thesigniﬁcant changes in the new reference genome. To the best of our knowledge, this is the ﬁrst work that provides a comprehensive remapping of reads fromone reference genome to another. While there are many works that are considered remapping tools, noneof them provide a comprehensive mapping to a new reference genome. We explain the subtle diﬀerencesbetween each of the remapping tools below.

UCSC LiftOver.

One of the most commonly used remapping tools is UCSC LiftOver [29]. UCSC LiftOveruses a chain ﬁle [1] between two diﬀerent assemblies of a genome to convert the coordinates from one irLift: A Fast and Comprehensive Technique for Translating Alignments between Reference Genomes 3 assembly to the assembly of the other genome. UCSC LiftOver suﬀers from three major shortcomings. First,UCSC LiftOver functionality is limited to the genomes whose assemblies are provided by the UCSC GenomeBrowser [28], hence, making it impossible to remap genomes whose assemblies are not yet included in thetool. Second, the tool only converts the coordinates of regions within the old reference genome that arehighly similar to regions within the updated reference genome and ignores regions with signiﬁcant variance(as we show in 3), which prevents a comprehensive remapping of the coordinates. Third, UCSC LiftOveronly supports BED-format (i.e., browser extensible data) input ﬁles which limits its usage even further.

CrossMap.

One alternative to UCSC LiftOver is CrossMap [33, 34]. CrossMap follows a similar approachwith UCSC LiftOver and uses chain ﬁles to convert mappings from an older reference genome to a newerreference genome. Compared to UCSC LiftOver, CrossMap supports a larger set of input ﬁle formats, suchas BAM, SAM, or CRAM, BED, Wiggle, BigWig, GFF (i.e., general feature format) or GTF (i.e., genetransfer format), and VCF (i.e., variant call format) [33, 34]. Unfortunately, CrossMap suﬀers from similarlimitations as UCSC LiftOver.

NCBI Genome Remapping Service.

Another alternative is NCBI Genome Remapping Service [18],which also remaps the annotations from one genome assembly to another. NCBI Remap has support for alarger set of input/output ﬁle formats, such as BED, GFF, GTF, and VCF. NCBI Remap can also performcross species remapping for a limited number of organisms. However, as with UCSC LiftOver, NCBI Remapis limited by the provided assemblies.

Segment liftover.

Segment liftover [9,10] is another tool that is designed to map coordinates of one genomeassembly to another genome’s assembly while maintaining the integrity of the genome segments that are notcontinuous anymore in the target assembly.

Galaxy.

Galaxy [11, 26] is a web-based platform, which has LiftOver as part of its toolset. This tool isbased on UCSC LiftOver [29] and the chain ﬁles provided by UCSC Genome Browser [28]. Thus, Galaxyalso suﬀers from similar limitations as UCSC LiftOver.

PyLiftover.

PyLiftover [27] is a Python implementation of a limited version of UCSC LiftOver. PyLiftoverdoes not convert ranges (i.e., only converts point coordinates) between diﬀerent assemblies, and it does notsupport BED-format input ﬁles.

Bazam.

Bazam [20] is another tool which remaps short paired reads by optimizing memory usage whileproviding high parallelism. However, Bazam only targets the steps where reads are read from a BAM orCRAM ﬁle (i.e., read extraction) and sent to an aligner (e.g., BWA [15]). Eventually, all the reads areremapped to the new reference genome, which is ineﬃcient.

Repeating a genomic study using another version of the reference genome is computationally very expensive.A faster and more convenient way to achieve this is to “remap” the mapping locations from the olderreference genome to its updated version [9, 10, 18, 26, 27, 29, 33, 34]. We evaluate the eﬀectiveness of one ofthe state-of-the-art tools, UCSC LiftOver [29], in updating the mapping information from one version of thehuman reference genome to another version. We present the evaluation result in Figure 1, where we showthe amount of information lost when remapping from one human reference genome version (x-axis) to thelatest human reference genome version (hg38). The y-axis shows the percentage of annotations (labeled andmarked with unique colors) missed when remapping using UCSC LiftOver. We make two key observationsbased on Figure 1. First, we observe that a signiﬁcant portion ( > all crucial. However, prior works mainly focus on the speed at the cost of both accuracy and coverage. Thesecrossmap tools are often very inaccurate and can only lift mappings or annotations for regions with minorchanges [35]. Therefore, if researchers want a comprehensive study using a new reference genome, they mustmap the entire read data set to the new reference genome rather than rely on the results of such crossmap Authors Suppressed Due to Excessive Length hg16 hg17 hg18 hg19

Human Genome Version Mapped to hg38 (ordered by release date) % A nn o t a t i o n s M i ss e d i n C r o ss m a p jeremie@staff-net-cx-0584:~/private/college/PhD/Projects/genome_remap/characterization/scripts$ python annotation_analysis.py ../data.csv Genes TranscriptsExons CDS Start CodonsStop Codons

Fig. 1.

Percentage of diﬀerent annotations missed when remapping reads from an old reference (x-axis) to the latestreference (hg38), using UCSC LiftOver [29].

Table 1.

Lost information when crosmapping across reference genome assemblies using UCSC LiftOver.New Reference hg19 hg38 O l d R e f e r e n c e gene exon stop codon CDS start codon transcript gene exon stop codon CDS start codon transcript hg16 hg17 hg18 hg19 – – – – – – 4.47 0.74 0.50 0.59 0.53 4.24Between each pair of reference genomes, we indicate the exact values of speciﬁc annotation types (e.g., gene, exon,stop codon, CDS, start codon, transcript) that are “lost” when using UCSC LiftOver [29] on a read data set from anold reference (rows) to a new reference (columns). Brieﬂy, 3.07% of the gene model coordinates in hg16 assembly arenot found in hg19, where the loss rate of genes reaches 4.47% between the most recent two assembly versions (hg19and hg38). tools [35]. Due to the high similarity between the old and new reference genomes, we believe we can useinformation from the old mapping to very quickly map a read data set to an updated reference genome. Ourgoal is to produce a method for quickly remapping the reads of a sample from one reference genome to anupdated version of the reference genome or another similar reference genome.

In this section, we describe AirLift, our technique for quickly mapping a read set from one reference genometo another. The key idea behind AirLift is to generate ﬁxed lookup tables (LUTs) for each pair of referencegenomes (old and new) that map locations of similar regions between the references. For a read that mapsto a location in the old reference genome, we can query the LUT to quickly identify potential locations formapping in the new reference genome. Depending on where the read mapped to in the old reference, weupdate the mapping location using diﬀerent methods. We next deﬁne these regions, show how to generatethe ﬁxed lookup tables, and then explain how to use these LUTs to quickly and comprehensively remap aread set.

We compare two reference genomes with large sequences (i.e., regions). We identify four types of regions(shown in Figure 2) that fully describe the relationship between two reference genomes:1. A constant region is a region of the genome which is exactly the same in both old and new referencegenomes (blue). irLift: A Fast and Comprehensive Technique for Translating Alignments between Reference Genomes 5

Old Reference GenomeNew Reference Genome …… Constant RegionUpdated RegionRetired RegionNew Region

Old ReferenceNew Reference …… Constant Region Updated RegionRetired Region New Region

Old Reference GenomeNew Reference Genome …… Constant Region Updated RegionRetired Region New Region

Fig. 2.

Reference Genome Regions.

2. An updated region is a region in the old reference genome that maps to at least one region in the newreference genome within reasonable error rates (orange with some diﬀerences marked with black bars).3. A retired region is a region in the old reference genome that does not map to any region in the newreference genome (red).4. A new region is a region in the new reference genome that does not map to any region in the old referencegenome (green).We next describe how we identify and use these regions to quickly and comprehensively remap a read set.

We propose to generate lookup tables (LUTs) to aid in the eﬃcient mapping of reads from one referencegenome to another reference genome. Figure 3 shows the methodology for creating the LUTs. Starting withthe old and new reference genomes, we must ﬁrst either acquire an available chain ﬁle [1] (e.g., [2]) or (1) generate our own. We create our chain ﬁle by running global alignment without errors between the tworeference genomes. This chain ﬁle shows where exact sequences from the old reference genome can be foundin the new reference genome. We refer to regions that match perfectly across the old and new referencegenome to be constant regions (blue). Next, we (2) extract seeds (i.e., smaller subsequences) from regionsin the new reference that do not align exactly (non-blue regions). Note that these seeds a) are the samelength ( N ) as the reads that we want to remap, and b) are completely overlapping sequences and starting N − N base pairs later (providing N X coverage on the region). Next, we (3) align the extracted seeds to the old reference genome to identifyregions of approximate similarity across the reference genomes. Note that this alignment can be done withany read mapper. The regions (in the old reference genome) that the extracted seeds align to and the regions(in the new reference genome) that the aligned seeds were extracted from are considered updated regions (orange). Since it is an approximate mapping, we indicate diﬀerences between the updated regions withblack bars. While we describe in more detail how we use these regions later, we can quickly tell that if aread mapped to an updated region in the old reference genome, there is a high chance that the read willmap to the respective updated region in the new reference genome. In order to guarantee a comprehensivemapping between updated regions, we map the extracted reads with an error rate of 2 e , where e is theacceptable error rate for an alignment to be considered a match. Figure 4 shows the worst-case example Exact global alignment between two references Old ReferenceNew Reference Extract seeds from regions that do not align exactly Align extracted seeds to the old reference

Overlapping seeds Constant Region Updated Region Retired Region New Region ✘ Check alignments to initially define regions

Seeds from new reference do not map to a retired region

Seeds from a new region do not map to the old reference ✘ ✘ Extract seeds from retired regions

Old ReferenceNew Reference Map seeds from retired to constant regions

Overlapping seeds Matching regions become updated regions Get constant regions LUT between references Get updated regions LUT between references

Fig. 3.

In order for AirLift to map any number of reads from an old reference genome to a new reference genome,AirLift must preprocess look up tables between the two references in the 8 steps enumerated. Authors Suppressed Due to Excessive Length

ACGTACGTCAAGATACAGAGACGTACGTGAAGATACAGAGACGTACGTCAAGATAGAGAG

Original read: Updated region in old reference:Updated region in new reference: 5% 10%

Fig. 4.

In order to comprehensively account for possible mappings of a read that previously mapped to an old referencegenome, we create a lookup table describing the similarity between two reference genomes, using 2 × the alignmenterror acceptance rate. If a read aligns to a location in the old reference genome with a 5% error rate, it is possiblefor the same read to map to a location in the new reference genome (with a 5% error rate) whose sequence is 10%diﬀerent from the sequence in the old reference genome. where a read (of length 20) aligns to a subsequence in the updated region of the old reference genome withan e = 5% error rate (mismatch on the 9 th base pair), and also aligns to a subsequence in the updated regionof the new reference genome with an e = 5% error rate (mismatch on the 16 th base pair). In the case wherewe only have the alignment information between the read and a subsequence from the updated region ofthe old reference genome, we can quickly identify potential mapping locations in the new reference genome(with an error rate of e ) given mappings between the updated regions of the old and new references withan error rate of 2 e . We also note that there are regions from the old reference where extracted seeds do not map anywhere in the new reference and regions in the new reference where no extracted seeds map to. Wemust (4) check the alignments of all the extracted seeds to determine which bucket each region falls into.We refer to a region whose extracted seeds do not map to the old reference genome at all as a new region ,since the region or anything similar to the region does not exist in the old reference genome. We refer to aregion in the old reference, which has no seeds mapping to it as a retired region , since the region or anythingsimilar does not exist in the new reference genome. Next, we (5) check to see whether regions within ourrecently-identiﬁed retired regions can be mapped to constant regions, since we had only previously checkedthem against the non-constant regions. We extract overlapping seeds from the retired region and (6) mapthem to the constant regions in the old reference genome. For any seeds that result in a match, we add therespective constant region in the new reference to the updated region and relabel the retired region in theold reference as an updated region. Following these procedures, we generate two lookup tables that aid inremapping a read set from the old reference to the new reference. (7) shows the constant regions lookup table(LUT) which essentially follows the format of a standard chain ﬁle describing how large regions map directlybetween reference genomes. (8) shows the updated regions lookup table (LUT) , which contains the mappingsof each seed location from updated locations in the old reference genome to the mapping locations in the newreference genome. We can use both of these LUTs (i.e., constant regions LUT and updated regions LUT) tore-map all reads from the old reference to the new reference quickly and comprehensively. Since it is possiblethat each location in the old reference genome can map to multiple locations in the new reference genome,these lookup tables are organized as maps with the keys being the location of the read in the old referencegenome, and the value is a list of locations that the location can map to in the new reference genome. Wequery these lookup tables with the location of a read that was mapped to the old reference genome to quicklyobtain a list of potential locations in the new reference genome to map the read to. We identify four independent cases for AirLifting reads from the old reference genome to the new referencegenome that we must handle to fully map a read set (highlighted in Figure 5): (1) a read that maps to a constant region in the old reference genome, (2) a read that maps to an updated region in the old referencegenome, (3) a read maps to a retired region in the old reference genome, and (4) a read that never mapped anywhere in the old reference genome. For a read falling in case (1) , we simply translate the mappinglocations according to the oﬀset in the speciﬁc constant region from the old reference to the new reference.Since this is the extent of existing state-of-the-art remapping tools capabilities, we can perform this step withany of these tools (e.g., LiftOver, CrossMap). For a read falling in case (2) , we simply query the updatedregions LUT, and align the read to the returned locations. We then return locations that align with anerror rate smaller than e . For a read falling in case (3) , we know that it will not map anywhere to the irLift: A Fast and Comprehensive Technique for Translating Alignments between Reference Genomes 7 Constant RegionsRetired RegionsMatchesMatchesNo Matches R e a d D a t a S e t Constant to Updated MapRetired to Updated MapUpdated Regions Hash Map + Filter Updated Matches to New Reference GenomeOld Reference Genome

Constant RegionsUpdated RegionsMatchesMatchesNo Matches Constant Regions LUTUpdated Regions LUTNew Regions Updated Matches to New Reference GenomeOld Reference Genome

Matches Retired Regions R e a d D a t a S e t Indicate no mappings

Fig. 5.

Cases for AirLifting reads between two reference genomes. reference genome, so we can mark it as an unmapped read. For a read falling in case (4) , since we haveno prior knowledge about the read other than the fact that it never mapped anywhere in the old referencegenome, we must fully map the read to each new and updated region in the new reference genome (usingyour preferred read mapper), since we know that the constant regions did not result in matches in previousattempts. We next discuss our methodology for evaluating AirLift.

Evaluated Read Mappers.

We evaluate AirLift using BWA-MEM [14] and Bazam [20]. Note that Bazamis not a standalone read mapper, instead it facilitates fast extraction of reads from an input BAM ﬁle andutilizes BWA-MEM for the actual mapping in a streaming fashion.

Evaluation System.

We run our entire toolchain on a server with 24 cores (2 threads per core, Intel XeonGold 5118 CPU @ 2.30GHz), and 192GB of the memory. We assign 32 threads to all the tools we use andcollect their runtime and memory usage using time command in Linux with -vp options. We report runtimeand peak memory usage of our evaluations based on these conﬁgurations.

Evaluated Reference Genomes.

We study the eﬀects of AirLift on versions of reference genomes ofvarying size across 3 species (i.e., human, C. elegans, yeast) as shown in Table 2. We study a mix of speciesto show the eﬀects of AirLift on reference genomes of varying sizes.

Table 2.

Details of the reference genomes that we use in our experiments.

Species Version Bases non-N Bases Release Date

Human hg19 3,137,144,693 2,897,293,955 2009-02-27Human hg38 3,209,286,105 3,049,316,098 2013-12-24C. elegans ce1 100,264,180 100,264,085 2003-05-02C. elegans ce2 100,291,769 100,291,761 2004-03-01C. elegans ce4 100,281,244 100,281,244 2007-01-01C. elegans ce6 100,281,426 100,281,244 2008-05-01C. elegans ce10 100,286,070 100,286,070 2012-04-13C. elegans ce11 100,286,401 100,286,401 2013-02-07Yeast SacCer1 12,156,302 12,156,302 2001-10-01Yeast SacCer2 12,162,995 12,162,995 2008-06-01Yeast SacCer3 12,157,105 12,157,105 2014-12-17

Evaluated Read Data Sets.

We use DNA-seq data sets from four diﬀerent samples (as shown in Table 3).

Table 3.

Our read data sets that we use in our experiments can be accessed via NCBI using the accession number.

Data Set Accession Details

Human NA12878 - Illumina ERR194147 795,505,905 paired-end reads (101bps each, 50 × coverage)Human NA12878 - Illumina ERR262997 643,097,275 paired-end reads (101bps each, 40 × coverage)C. elegans N2 - Illumina SRR3536210 78,696,056 paired-end reads (101bps each, 150 × coverage)Yeast S288C - Illumina ERR1938683 3,318,467 paired-end reads (150bps each, 82 × coverage) Authors Suppressed Due to Excessive Length We ﬁrst show our ﬁndings on how two reference genome versions relate to each other. Table 4 shows theregion sizes (i.e., constant, updated, retired, new) that each pair of reference genomes are comprised of (asa percentage of all the regions combined). The values in parenthesis show the percentage of reads out of theentire read set (mapped to the old reference genome) that fall in each region of the old reference genome (i.e.,constant, updated, retired). We note that the closer the version numbers between the pair of references areto each other, 1) the larger the constant region is, and 2) the smaller the updated region is. This is intuitiveas each reference genome version releases incremental changes to update missing and inaccurate sequences,so the similarity between subsequent releases would likely be higher than between releases further apart. Wealso observe, as expected, that the percentage of reads that map to a region in the reference is correlatedwith the region size (i.e., larger regions have a larger percentage of the read set mapped to that region). Asthe method for remapping a read depends on the type of region it is mapped to in the old reference, we canestimate the execution time of using AirLift on an entire read set with the proportions of reads that fall indiﬀerent regions (in Table 4). Since the most expensive method for remapping in AirLift (i.e., alignment) isemployed only for reads that mapped to updated and retired regions of the old reference, we can expect,based on the signiﬁcantly small proportion of reads (e.g., between 0.36 and 13.68%) in the updated andretired regions of Table 4, a signiﬁcant reduction in the mapping time.We next plot the actual reduction (y-axis) we observe in the number of reads for the pairs of referencegenomes (x-axis) that we examine in Figure 6. We make two observations. First, we observe that the reductionin the read set is signiﬁcant, between 86% (for hg19hg38) and 99.967% (for ce4ce6). This is the maincontributor to our performance improvement, as the number of alignments performed is signiﬁcantly reduced.Second, we observe from the C. elegans samples that remapping a read set between subsequent genome version(e.g., ce1ce2, ce2ce4, ce4ce6) results in a signiﬁcantly reduced read set. We conclude from this ﬁgure thatwe can exploit the high similarity between reference genome versions to signiﬁcantly reduce the number ofreads that we need to map for a comprehensive read set mapping.

Table 4.

Reference Genome Regions.Species Remapping a read set Constant (%) Updated (%) Retired (%) New (%)From ToHuman hg19 hg38 84.71 (86.31) 2.83 (13.64) 7.12 (0.045) 3.44C. elegans ce1 ce10 98.85 (99.39) 0.63 (0.61) 0.01 (0.004) 0.51ce2 98.95 (99.45) 0.58 (0.55) 0.01 (0.004) 0.46ce4 99.23 (99.60) 0.40 (0.40) 0.01 (0.004) 0.33ce6 99.31 (99.64) 0.36 (0.36) 0.01 (0.004) 0.32ce1 ce11 95.78 (97.90) 2.51 (2.09) 0.01 (0.012) 1.70ce2 95.87 (97.94) 2.47 (2.05) 0.01 (0.011) 1.65ce4 96.12 (98.13) 2.31 (1.85) 0.01 (0.012) 1.56ce6 96.16 (98.13) 2.27 (1.86) 0.01 (0.012) 1.54ce10 96.74 (98.48) 1.92 (1.51) 0.01 (0.006) 1.33Yeast SacCer1 SacCer2 97.17 (98.64) 1.70 (1.34) 0.11 (0.018) 1.02SacCer1 SacCer3 88.88 (93.87) 7.86 (6.11) 0.12 (0.024) 3.14SacCer2 90.51 (95.00) 6.53 (4.98) 0.15 (0.024) 2.81We show for our selected species’ reference genomes, human (large), C. elegans (medium), yeast (small) how versionsof the reference genome (row) are comprised of distinct regions (i.e., constant, updated, retired, new) in relation to amore recent version of the species. Each cell contains the percentages of the reference genome pair that each region(columns) comprises. The value in parentheses is the number of reads as a percentage of the read sets (used for thespecies) that originally mapped to the region in the old reference genome. Note 0% of the read set is mapped to thenew region, since new regions do not exist in the old reference genome.irLift: A Fast and Comprehensive Technique for Translating Alignments between Reference Genomes 9 h g h g c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e s a c C e r s a c C e r s a c C e r s a c C e r s a c C e r s a c C e r c e c e c e c e

14 42 . . . large medium small .

05 0 .

23 0 .

26 0 .

61 2 . .

16 0 .

19 0 .

55 2 . .

033 0 .

40 1 . .

36 1 . . . . . R ea d s R e m a i n i n g ( % ) Fig. 6. AirLift read reduction results.

We show the percentage of reads (out of the original read set) that weneed to align to the new reference genome in order to account for the reads that state-of-the-art remapping toolsdo not translate. The x-axis sweeps various pairings of reference genomes where the naming convention is the oldreference followed by the new reference. The y-axis is the percentage of reads that we must map to the new referencegenome, and the speciﬁc values are written above each bar.

We next look at how using AirLift reduces the time to map a set of reads to an updated reference genome byreducing the number of reads that we must map. Figure 7 plots the speedup (y-axis) in execution time formapping a read set to a new reference genome when using AirLift from an old reference genome compared A i r L i f t S p ee d u p

500 203040501004681020 h g h g c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e c e s a c C e r s a c C e r s a c C e r s a c C e r s a c C e r s a c C e r c e c e c e c e .

94 188 141 138 92 . . . . . . . . . . large medium small Fig. 7. AirLift runtime results.

We show the execution time speedup (y-axis) of running AirLift on a read set toa new reference genome against the baseline of fully mapping a read set to the new reference genome. We plot theresults for a various pairs of reference genomes (x-axis) in three separate plots for diﬀerent sizes of reference genomes(i.e., large, medium, small). to fully mapping the entire read set to the new reference genome. The execution time of AirLift is calculatedas follows: T AirLift = T read extraction + T map retired reads + T map updated reads + T lift constant reads (1)where T read extraction is the time to extract the reads from the read set into subsets for each type of regionthey map to in the old reference genome, T map retired reads and T map updated reads are the times to map readsfrom retired and updated reads to the new reference genome, and T lift constant reads is the time to directlyshift coordinates from the old reference to the new reference based on the chain ﬁle that we generate. Notethat we ignore the time to generate the LUTs since they are generated once per pair of reference genomesand can be used to remap any number of reads. We then divide the time to map the full read set tothe new reference genome by T AirLift to get the speedup that we plot. The x-axis sweeps various referencegenome pairs that we use AirLift with, where the naming convention is the old reference followed by thenew reference. We make three observations. First, we ﬁnd that the speedup AirLift provides is inversely proportional to the percentage of reads that mapped to the updated and retired regions of the old referencegenome. As the percentage of reads in updated and retired regions are directly proportional to the sizes ofthe updated and retired regions (Table 4), which are generally correlated with the distance of the genomeversion numbers (within a specie’s reference genome), we can generally claim that AirLift will perform betterin reference genomes whose versions are closer. Second, when performing AirLift from the second most recentgenome version to the most recent genome version across our selected species, AirLift provides 6.94 × , 44.0 × ,and 16.4 × speedup for large (human), medium (C. elegans), and small (yeast) reference genomes. Third,for a pair of references whose updated and retired regions are very small such as ce1ce2 ( ≈ . ≈ . × and 202 × speedup, respectively. We conclude that AirLift cansigniﬁcantly improve the time to remap a read set from an old reference to a new reference compared to thebaseline of fully mapping the read set to the new reference. In this work, we propose AirLift, a technique for comprehensively mapping a read data set that had previouslybeen mapped to an older reference genome to a newer reference genome. AirLift exploits the similarity acrossreference genome versions to generate maps that describe the similarity across reference genomes and usesthese to quickly identify locations that reads should directly be translated to or mapped to. AirLift is theﬁrst tool that enables a comprehensive mapping of reads from an old reference genome to a new referencegenome, as prior state-of-the-art tools only translate alignments between regions with high similarity. Whencompared against the baseline of fully mapping a read data set to the new reference genome, we ﬁnd thatAirLift can reduce the overall number of reads that needs to be remapped from the original read set by up to99% and reduce the overall runtime required to remap the reads from the old reference genome to the newreference genome by 6.94 × , 44.0 × , and 16.4 × for large (human), medium (C. elegans), and small (yeast)reference genomes, respectively. We conclude that AirLift substantially reduces the overhead of remappinga read set to a new reference while still accounting for the signiﬁcant changes in the new reference. References

1. “Chain Format,” https://genome.ucsc.edu/goldenPath/help/chain.html.2. “Sequence and Annotation Downloads,” http://hgdownload.soe.ucsc.edu/downloads.html.3. 1000 Genomes Project Consortium, “A Global Reference for Human Genetic Variation,”

Nature , vol. 526, no.7571, p. 68, 2015.4. S.-M. Ahn et al. , “The First Korean Genome Sequence and Analysis: Full Genome Sequencing for a Socio-ethnicGroup,”

Genome Research , vol. 19, no. 9, pp. 1622–1629, 2009.5. I. S. Al-Mssallem et al. , “Genome Sequence of the Date Palm Phoenix dactylifera L,”

Nature Communications ,vol. 4, p. 2274, 2013.6. C. Alkan et al. , “Limitations of Next-Generation Genome Sequence Assembly,”

Nature Methods

Proceedings of the IEEE , vol. 105,no. 3, pp. 436–458, 2015.9. B. Gao, “Segment Liftover,” https://pypi.org/project/segment-liftover/.10. B. Gao et al. , “Segment Liftover: A Python Tool to Convert Segments Between Genome Assemblies,”

F1000Research , vol. 7, 2018.11. B. Giardine et al. , “Galaxy: A Platform for Interactive Large-scale Genome Analysis,”

Genome Research , vol. 15,no. 10, pp. 1451–1455, 2005.12. Y. Guo et al. , “Improvements and Impacts of GRCh38 Human Reference on High Throughput Sequencing DataAnalysis,”

Genomics , vol. 109, no. 2, pp. 83–90, 2017.13. T. Huang et al. , “Genetic Diﬀerences among Ethnic Groups,”

BMC Genomics , vol. 16, no. 1, p. 1093, 2015.14. H. Li, “Aligning Sequence Reads, Clone Sequences and Assembly Contigs with BWA-MEM,” arXiv:1303.3997,2013.15. H. Li and R. Durbin, “Fast and Accurate Short Read Alignment with Burrows–Wheeler Transform,”

Bioinfor-matics , vol. 25, no. 14, pp. 1754–1760, 2009.irLift: A Fast and Comprehensive Technique for Translating Alignments between Reference Genomes 1116. X. Ma et al. , “Analysis of Error Proﬁles in Deep Next-Generation Sequencing Data,”

Genome Biology , vol. 20,no. 1, p. 50, 2019.17. S. Mallick et al. , “The Simons Genome Diversity Project: 300 Genomes from 142 Diverse Populations,”

Nature et al. , “Comparative Analysis of Algorithms for Next-Generation Sequencing Read Alignment,”

Bioin-formatics , vol. 27, no. 20, pp. 2790–2796, 2011.20. S. P. Sadedin and A. Oshlack, “Bazam: A Rapid Method for Read Extraction and Realignment of High-Throughput Sequencing Data,”

Genome Biology , vol. 20, no. 1, p. 78, 2019.21. S. C. Schuster et al. , “Complete Khoisan and Bantu Genomes from Southern Africa,”

Nature , vol. 463, no. 7283,p. 943, 2010.22. D. Senol Cali et al. , “Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysisof the Current State, Bottlenecks and Future Directions,”

Brieﬁngs in Bioinformatics , vol. 20, no. 4, pp. 1542–1559, 2019.23. R. M. Sherman et al. , “Assembly of a Pan-genome from Deep Sequencing of 910 Humans of African Descent,”

Nature Genetics , vol. 51, no. 1, p. 30, 2019.24. H. G. Shukla et al. , “hg19KIndel: Ethnicity Normalized Human Reference Genome,”

BMC Genomics , vol. 20,no. 1, p. 459, 2019.25. K. M. Steinberg et al. , “Building and Improving Reference Genome Assemblies,”

Proceedings of the IEEE et al. , “The Diploid Genome Sequence of an Asian Individual,”

Nature , vol. 456, no. 7218, p. 60, 2008.32. P. Xu et al. , “Genome Sequence and Genetic Diversity of the Common Carp, Cyprinus carpio,”

Nature Genetics ,vol. 46, no. 11, p. 1212, 2014.33. H. Zhao et al. , “CrossMap: A Versatile Tool for Coordinate Conversion Between Genome Assemblies,”

Bioinfor-matics , vol. 30, no. 7, pp. 1006–1007, 2013.34. Zhao, Hao and Sun, Zhifu and Wang, Jing and Huang, Haojie and Kocher, Jean-Pierre and Wang,Liguo, “CrossMap: Convert Genome Coordinates Between Assemblies,” http://crossmap.sourceforge.net/ et al. , “Alignment of 1000 Genomes Project Reads to Reference Assembly GRCh38,”