[PDF] Accelerating Genome Analysis: A Primer on an Ongoing Journey

Abstract

Genome analysis fundamentally starts with a process known as read mapping, where sequenced fragments of an organism's genome are compared against a reference genome. Read mapping is currently a major bottleneck in the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are able to sequence a genome much faster than the computational techniques employed to analyze the genome. We describe the ongoing journey in significantly improving the performance of read mapping. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory). We conclude with the challenges of adopting these hardware-accelerated read mappers.

Full PDF

AAccelerating Genome Analysis:A Primer on an Ongoing Journey

Mohammed Alser (cid:5)

Z ¨ulal Bing¨ol (cid:79)

Damla Senol Cali (cid:111)(cid:110)

Jeremie Kim (cid:5) (cid:111)(cid:110)

Saugata Ghose (cid:63) (cid:111)(cid:110)

Can Alkan (cid:79)

Onur Mutlu (cid:5) (cid:111)(cid:110) (cid:79) (cid:5)

ETH Z¨urich (cid:79)

Bilkent University (cid:111)(cid:110)

Carnegie Mellon University (cid:63)

University of Illinois at UrbanaChampaign

Abstract —Genome analysis fundamentally starts with a process known as read mapping , wheresequenced fragments of an organism’s genome are compared against a reference genome. Readmapping is currently a major bottleneck in the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are able to sequence a genome much faster than thecomputational techniques employed to analyze the genome. We describe the ongoing journeyin signiﬁcantly improving the performance of read mapping. We explain state-of-the-art algorith-mic methods and hardware-based acceleration approaches. Algorithmic approaches exploit thestructure of the genome as well as the structure of the underlying hardware. Hardware-basedacceleration approaches exploit specialized microarchitectures or various execution paradigms(e.g., processing inside or near memory). We conclude with the challenges of adopting thesehardware-accelerated read mappers. G ENOME ANALYSIS is the foundation ofmany scientiﬁc and medical discoveries, andserves as a key enabler of personalized medicine.This analysis is currently limited by the inabilityof modern genome sequencing technologies toread an organism’s complete genome. Instead, se-quencing machines extract smaller random frag-ments of an organism’s DNA sequence, knownas reads . While the human genome contains overthree billion bases (i.e., A, C, G, T in DNA), thelength of a read is orders of magnitude smaller,ranging from a few hundred bases (for short reads) to a few million bases (for long reads).Computers are used to perform genome assembly ,which reassembles read fragments back into anentire genome sequence. Genome assembly is currently the bottleneck to quickly and accuratelydetermining an individual’s entire genome, due tothe complex algorithms and large datasets usedfor assembly.A widely-used approach for genome assemblyis to perform sequence alignment , which com-pares read fragments against a known referencegenome (i.e., a complete representative DNA se-quence for a particular species). A process knownas read mapping matches each read generatedfrom sequencing to one or more possible lo-cations within the reference genome, based onthe similarity between the read and the referencesequence segment at that location. Unfortunately,the bases in a read may not be identical to thebases in the reference genome at the location that

IEEE MICRO

Published by the IEEE Computer Society c (cid:13) This is an extended and updated version of a paper published inIEEE Micro, vol. 40, no. 5, pp. 65-75, 1 Sept.-Oct. 2020https:// doi.or g/ 10.1109/ MM.2020.3013728 a r X i v : . [ c s . A R ] S e p ohammed Alser et al. the read actually comes from. These differencesmay be due to (1) sequencing errors (up to 0.1%in short reads [1] and up to 20% in long reads [2],[3]) during extraction, and (2) genetic mutationsthat are speciﬁc to the individual organism’s DNAand may not exist in the reference genome. Dueto these potential differences, the similarity be-tween a read and a reference sequence segmentmust be identiﬁed using an approximate stringmatching (ASM) algorithm. The possible geneticdifferences between the reference genome andthe sequenced genome are then identiﬁed using genomic variant calling algorithms [4].The ASM performed during read mappingtypically uses a computationally-expensive dy-namic programming (DP) algorithm. This time-consuming algorithm has long been a major bot-tleneck in the entire genome analysis pipeline,accounting for over 70% of the execution timeof read mapping [5]. The vast majority of readmappers, such as the widely-used minimap2 [6],are implemented as software running on CPUs.We refer readers to a comprehensive survey [7]for a discussion of state-of-the-art CPU-basedread mappers. Accelerating ASM can help bridgethe wide performance gap between sequencingmachines and CPU-based read mapping algo-rithms, but faces four key challenges:1) Due to the large datasets that a read mapperoperates on, it generates a large amount of data movement between the CPU and mainmemory. The CPU accesses off-chip mainmemory through a pin-limited bus known asthe memory channel , and a high amount ofdata movement across the memory channelis extremely costly in terms of both execu-tion time and energy [8], [9].2) Modern sequencing machines generate readfragments at an exponentially higher ratethan prior sequencing technologies, withtheir growth far outpacing the growth incomputational power in recent years [10].For example, the Illumina NovaSeq 6000system can sequence about 48 human wholegenomes at 30 × genome coverage (the av-erage number of times a genomic base issequenced) in about two days. However,analyzing (performing mapping and variantcalling) the sequencing data of a single hu- man genome requires over 32 CPU hours ona 48-core Intel Xeon processor, 23 of whichare spent on read mapping [11].3) The ﬁrst two challenges worsen when a metagenomic sample is proﬁled, where thesample donor is unknown. This requiresmatching the extracted reads to thousands of reference genomes [12].4) There is also an urgent need for rapidlyincorporating clinical DNA sequencing andanalysis into clinical practice for rapidsurveillance of disease outbreaks (e.g.,COVID-19 [13]) and early diagnosis of ge-netic disorders in critically ill infants [14].Increasing the number of CPUs used forgenome analysis decreases the overall analysistime, but signiﬁcantly increases energy consump-tion and hardware costs. Cloud computing plat-forms are a potential alternative to distribute theworkload at a reasonable cost, but are disalloweddue to data protection guidelines in many coun-tries [15].As a result, there is a dire need for new com-putational techniques that can quickly process andanalyze a tremendous number of extracted readsin order to drive cutting-edge advances in the ge-netic applications space [16]. Many works boostthe performance of existing and new read map-pers using new algorithms, hardware/software co-design, and hardware accelerators. Our goal inthis work is to survey a prominent set of thesethree types of acceleration efforts for guidingthe design of new highly-efﬁcient read mappers. To this end , we (1) discuss various state-of-the-art mechanisms and techniques that improve theexecution time of read mapping using differentmodern high-performance computing architec-tures, and (2) highlight the challenges, in the lastsection, that system architects and programmersmust address to enable the widespread adoptionof hardware-accelerated read mappers.

Read Mapping

The main goal of read mapping is to locatepossible subsequences of the reference genomesequence that are similar to the read sequencewhile allowing at most E edits, where E is the edit distance threshold . Commonly allowed editsinclude deletion, insertion, and substitution ofcharacters in one or both sequences. IEEE MICRO dits can be as short as a single base pair(bp) alteration [17] or a much longer alteration(e.g., an insertion of about 600,000-base longregion [18]). Mapping billions of reads to the ref-erence genome is computationally expensive [16],[5], [19]. Therefore, most read mapping algo-rithms apply two key heuristic steps, indexing and ﬁltering , to reduce the number of referencegenome segments that need to be compared witheach read. The three steps of read mapping areshown in Figure 1a. First, a read mapper indexesthe reference genome by using substrings (called seeds ) from each read to quickly identify allpotential mapping locations of each read in thereference genome. Second, the mapper uses ﬁlter-ing heuristics to examine the similarity for everysequence pair (a read sequence and one potentialmatching segment in the reference genome iden-tiﬁed during indexing). These ﬁltering heuristicsaim to eliminate most of the dissimilar sequencepairs. Third, the mapper performs sequence align- ment (using ASM) to check whether or not theremaining sequence pairs that are identiﬁed byﬁltering to be similar are actually similar. Thealignment step examines all possible preﬁxes oftwo sequences and tracks the preﬁxes that pro-vide the highest possible alignment score (knownas optimal alignment). The alignment score isa quantitative representation of the quality ofan alignment for a given user-deﬁned scoringfunction (computed based on the number of editsand/or matches).Alignment algorithms typically use DP-basedapproaches to avoid re-examining the same pre-ﬁxes many times. These DP-based algorithmsprovide the most accurate alignment results com-pared to other non-DP algorithms (such as thealgorithm used in HISAT2 [20]), but they havequadratic time and space complexity (i.e., O ( m )for a sequence length of m ). Sequence alignmentcalculates information about the alignment suchas the alignment score, edit distance , and the Read Mapping

Genomic Variants k-mer content k-mer locationsk-mers Index k-mersRead

Indexing Pre-alignment Filtering Sequence Alignment

Locating common k-mers ...

Reference Genome

Reference subsequences extracted at each common k-mer location SAM file (alignment score, edit distance, type and location of each edit) q-gram filtering Pigeonhole principleReducing the number of seedsReducing data movement during indexing Accurate alignment acceleratorsHeuristic-based alignment accelerators

Dynamic Programming (DP) Matrix

1, 4, 63, 5, 1250, 522, 10017, 1923, 90 1, 4, 63, 5, 1217, 19

Read R e f e r en c e s ub s equen c e Genomic Sample Reads

Read Mapping F il t e r Accelerating Indexing (a)(b)

Sequencing Machine

Accelerating Pre-alignment Filtering Accelerating Alignment

Genome Analysis Pipeline ......... ... F il t e r F il t e r N ... Base countingSparse DP

Output

Figure 1. (a) The three steps of read mapping in genome analysis: (1) indexing, (2) pre-alignment ﬁltering, and(3) sequence alignment. (b) Overview of the existing approaches to accelerating each step of read mapping.

March 2020 ohammed Alser et al. type of each edit. Edit distance is deﬁned as theminimum number of changes needed to converta sequence into the other sequence [21]. Suchinformation is typically output by read mappinginto a sequence alignment/map (SAM) ﬁle. Giventhe time spent on read mapping, all three stepshave been targeted for acceleration. Figure 1bsummarizes the different acceleration approaches,and we discuss a set of such works in the follow-ing sections. Accelerating Indexing

The indexing operation generates a table thatis indexed by the contents of a seed, and identiﬁesall locations where the seed exists in the referencegenome. Indexing needs to be done only once fora reference genome, and eliminates the need toperform ASM across the entire genome. Duringread mapping, a seed from a read is looked upin the table, and only the corresponding locationsare used for ASM (as only they can match theentire read). The major challenge with indexingis choosing the appropriate length and numberof to-be-indexed seeds, as they can signiﬁcantlyimpact the memory footprint and overall perfor-mance of read mapping [6]. Querying short seedspotentially leads to a large number of mappinglocations that need to be checked for a stringmatch. The use of long reads requires extractingfrom each read a large number of seeds, as thesequencing error rate is much higher in longreads. This affects (1) the number of times wequery the index structure and (2) the number ofretrieved mapping locations. Thus, there are twokey approaches used for accelerating the indexingstep (Figure 1b).

Reducing the Number of Seeds

Read mapping algorithms (e.g., minimap2 [6])typically reduce the number of seeds that arestored in the index structure by ﬁnding the min-imum representative set of seeds (called mini-mizers ) from a group of adjacent seeds withina genomic region. The representative set can becalculated by imposing an ordering (e.g. lexi-cographically or by hash value) on a group ofadjacent seeds and storing only the seed with thesmallest order. Read mappers also apply heuris-tics to avoid examining the mapping locations ofa seed that occur more times than a user-deﬁned threshold value [6]. Various data structures havebeen proposed and implemented to both reducethe storage cost of the indexing data structure andimprove the algorithmic runtime of identifyingthe mapping locations within the indexing datastructure. One example of such data structures isFM-index (implemented by Langarita et al. [22]),which provides a compressed representation ofthe full-text index, while allowing for queryingthe index without the need for decompression.This approach has two main advantages. 1) Wecan query seeds of arbitrary lengths, which helpsto reduce the number of queried seeds. 2) Ittypically has less (by 1.5 – 2 × ) memory foot-print compared to that of the indexing step ofminimap2 [6]. However, one major bottleneck ofFM-indexes is that locating the exact matchesby querying the FM-index is signiﬁcantly slowerthan that of classical indexes [23], [22]. BWA-MEM2 [23] proposes an uncompressed version ofthe FM-index that is at least 10 × larger than thecompressed FM-index to speed up the queryingstep by 2 × . Reducing Data Movement During Indexing

RADAR [24] observes that the indexing stepis memory intensive, because the large numberof random memory accesses dominates com-putation. The authors propose a processing-in-memory (PIM) architecture that stores the entireindex inside the memory and enables querying thesame index concurrently using a large number ofASIC compute units. The amount of data move-ment is reduced from tens of gigabytes to a fewbytes for a single query task, allowing RADAR tobalance memory accesses with computation, andthus provide speedups and energy savings.

Accelerating Pre-Alignment Filtering

After ﬁnding one or more potential mappinglocations of the read in the reference genome,the read mapper checks the similarity betweeneach read and each segment extracted at thesemapping locations in the reference genome. Thesesegments can be similar or dissimilar to the read,though they share common seeds. To avoid exam-ining dissimilar sequences using computationally-expensive sequence alignment algorithms, readmappers typically use ﬁltering heuristics that arecalled pre-alignment ﬁlters . IEEE MICRO he key idea of pre-alignment ﬁltering is toquickly estimate the number of edits between twogiven sequences and use this estimation to decidewhether or not the computationally-expensiveDP-based alignment calculation is needed — ifnot, a signiﬁcant amount of time is saved byavoiding DP-based alignment. If two genomicsequences differ by more than the edit distancethreshold, then the two sequences are identiﬁedas dissimilar sequences and hence DP calculationis not needed. In practice, only genomic sequencepairs with an edit distance less than or equal toa user-deﬁned threshold (i.e., E ) provide usefuldata for most genomic studies [25], [5], [7]. Pre-alignment ﬁlters use one of four major approachesto quickly ﬁlter out the dissimilar sequence pairs:(1) the pigeonhole principle, (2) base counting,(3) q -gram ﬁltering, or (4) sparse DP. Long readmappers typically use q -gram ﬁltering or sparseDP, as their performance scales linearly with readlength and independently of the edit distance. Pigeonhole Principle

The pigeonhole principle states that if E items are put into E +1 boxes, then one or moreboxes would be empty. This principle can beapplied to detect dissimilar sequences and discardthem from the candidate sequence pairs used forASM. If two sequences differ by E edits, thenthey should share at least a single subsequence(free of edits) among E +1 non-overlapping sub-sequences [5], where E is the edit distance thresh-old. For a read of length m , if there are no morethan E edits between the read and the referencesegment, then the read and reference segment areconsidered similar if they share at most E +1 non-overlapping subsequences, with a total length ofat least m – E .The problem of identifying these E +1 non-overlapping subsequences is highly parallelizable,as these subsequences are independent of eachother. Shouji [5] exploits the pigeonhole principleto reduce the search space and provide a scalablearchitecture that can be implemented for anyvalues of m and E , by examining common subse-quences independently and rapidly with high par-allelism. Shouji accelerates sequence alignmentby 4.2-18.8 × without affecting the alignmentaccuracy. We refer the reader to the sidebar for abrief discussion of several other related works. Sidebar: Related Works on Pre-Alignment FilteringUsing the Pigeonhole Principle

Pigeonhole-ﬁltering-based pre-alignment ﬁltering can ac-celerate read mappers even without specialized hardware.For example, the Adjacency Filter [1] accelerates sequencealignment by up to 19 × . The accuracy and speed of pre-alignment ﬁltering with the pigeonhole principle have beenrapidly improved over the last seven years. Shifted HammingDistance (SHD) [2] uses SIMD-capable CPUs to providehigh ﬁltering speed, but supports a sequence length up toonly 128 base pairs due to the SIMD register widths. Gate-Keeper [3] utilizes the large amounts of parallelism offeredby FPGA architectures to accelerate SHD and overcomesuch sequence length limitations. MAGNET [4] provides acomprehensive analysis of all sources of ﬁltering inaccuracyof GateKeeper and SHD. Shouji [5] leverages this analysisto improve the accuracy of pre-alignment ﬁltering by up totwo orders of magnitude compared to both GateKeeper andSHD, using a new algorithm and a new FPGA architecture.SneakySnake [6] achieves up to four orders of magnitudehigher ﬁltering accuracy compared to GateKeeper and SHDby mapping the pre-alignment ﬁltering problem to the singlenet routing (SNR) problem in VLSI chip layout. SNR ﬁndsthe shortest routing path that interconnects two terminalson the boundaries of a VLSI chip layout in the presenceof obstacles. SneakySnake is the only pre-alignment ﬁlterthat works on CPUs, GPUs, and FPGAs. GenCache [7]proposes to perform highly-parallel pre-alignment ﬁlteringinside the CPU cache to reduce data movement and improveenergy efﬁciency, with about 20% cache area overhead.GenCache shows that using different existing pre-alignmentﬁlters together (similar approach to [26]), each of whichoperates only for a given edit distance threshold (e.g., usingSHD only when E is between 1 and 5), provides a 2.5 × speedup over GenCache with a single pre-alignment ﬁlter. (cid:4) REFERENCES

1. Hongyi Xin et al. Accelerating Read Mapping with FastHASH.

BMCGenomics , 2013.2. Hongyi Xin et al. Shifted Hamming Distance: A Fast and AccurateSIMD-Friendly Filter to Accelerate Alignment Veriﬁcation in ReadMapping.

Bioinformatics , 2015.3. Mohammed Alser et al. GateKeeper: A New Hardware Architecturefor Accelerating Pre-Alignment in DNA Short Read Mapping.

Bioin-formatics , 2017.4. Mohammed Alser et al. MAGNET: Understanding and Improvingthe Accuracy of Genome Pre-Alignment Filtering.

Transactions onInternet Research , 2017.5. Mohammed Alser et al. Shouji: A Fast and Efﬁcient Pre-AlignmentFilter for Sequence Alignment.

Bioinformatics , 2019.6. Mohammed Alser et al. SneakySnake: A Fast and Accurate Uni-versal Genome Pre-Alignment Filter for CPUs, GPUs, and FPGAs.arXiv:1910.09020 [q-bio.GN], 2019.7. Anirban Nag et al. GenCache: Leveraging In-Cache Operators forEfﬁcient Sequence Alignment. In

MICRO , 2019.

March 2020 ohammed Alser et al. Base Counting

The base counting ﬁlter compares the num-bers of bases (A, C, G, T) in the read with the cor-responding base counts in the reference segment.If one sequence has, for example, three more Tsthan another sequence, then their alignment hasat least three edits. If the difference in countis greater than E , then the two sequences aredissimilar and the reference segment is discarded.The base counting ﬁlter is used in mrsFAST-Ultra [27] and GASSST [26]. Such a simpleﬁltering approach rejects a signiﬁcant fractionof dissimilar sequences (e.g., 49.8%–80.4% ofsequences, as shown in GASSST [26]) and thusavoids a large fraction of expensive veriﬁcationcomputations required by sequence alignment al-gorithms. q -gram Filtering Approach The q -gram ﬁltering approach considers all ofthe sequence’s possible overlapping substrings oflength q (known as q -grams). Given a sequence oflength m , there are m – q + 1 overlapping q -gramsthat are obtained by sliding a window of length q over the sequence. A single difference in oneof the sequences can affect at most q overlapping q -grams. Thus, E differences can affect no morethan q · E q -grams, where E is the edit distancethreshold. The minimum number of shared q -grams between two similar sequences is therefore( m – q + 1) – ( q · E ). This ﬁltering approach requiresvery simple operations (e.g., sums and compar-isons), which makes it attractive for hardwareacceleration, such as in GRIM-Filter [25]. GRIM-Filter exploits the high memory bandwidth andcomputation capability in the logic layer of 3D-stacked memory to accelerate q -gram ﬁltering inthe DRAM chip itself, using a new representationof reference genome that is friendly to in-memoryprocessing. q -gram ﬁltering is generally robustin handling only a small number of edits, asthe presence of edits in any q -gram is signiﬁ-cantly underestimated (e.g., counted as a singleedit) [28]. Sparse Dynamic Programming

Sparse DP algorithms exploit the exactmatches (seeds) shared between a read and a ref-erence segment to reduce execution time. Thesealgorithms exclude the corresponding locations of these seeds from estimating the number ofedits between the two sequences, as they werealready detected as exact matches during index-ing. Sparse DP ﬁltering techniques apply DP-based alignment algorithms only between everytwo non-overlapping seeds to quickly estimatethe total number of edits. This approach is alsoknown as chaining , and is used in minimap2 [6]and rHAT [29]. The recent work in [30] presentsGPU and FPGA accelerators that achieve 7 × and 28 × acceleration, respectively, compared tothe sequential implementation (executed with 14CPU threads) of the chaining algorithm used inminimap2. Accelerating Sequence Alignment

After ﬁltering out most of the mapping lo-cations that lead to dissimilar sequence pairs,read mapping calculates the sequence alignmentinformation for every read and reference segmentextracted at each mapping location. Sequencealignment calculation is typically accelerated us-ing one of two approaches: (1) accelerating theDP-based algorithms using hardware acceleratorswithout altering algorithmic behavior, and (2) de-veloping heuristics that sacriﬁce the optimality ofthe alignment score solution in order to reducealignment time.Despite more than three decades of attemptsto accelerate sequence alignment, the fastestknown edit distance algorithm [31] has a nearlyquadratic running time, O ( m / log m ) for a se-quence length of m , which is proven to be a tightbound [32]. With maintaining the algorithmicbehavior as in the ﬁrst approach, it is challengingto rapidly calculate sequence alignment of longreads with high parallelism. As long reads havehigh sequencing error rates (up to 20% of theread length), the edit distance threshold for longreads is typically higher than that for short reads,which results in calculating more entries in theDP matrix compared to that of short reads. Theuse of heuristics (i.e., the second approach) helpsto reduce the number of calculated entries in theDP matrix and hence allows both the executiontime and memory footprint to grow only linearlywith read length (as opposed to quadraticallywith classical DP). Next, we describe the twoapproaches in detail. IEEE MICRO ccurate Alignment Accelerators

From a hardware perspective, sequence align-ment acceleration has ﬁve directions: (1) us-ing SIMD-capable CPUs, (2) using multicoreCPUs and GPUs, (3) using FPGAs, (4) usingASICs, and (5) using processing-in-memory ar-chitectures. Traditional DP-based algorithms aretypically accelerated by computing only the nec-essary regions (i.e., diagonal vectors) of the DPmatrix rather than the entire matrix, as proposedin Ukkonen’s banded algorithm [33]. This re-duces the search space of the DP-based algorithmand reduces computation time. The number ofdiagonal bands required for computing the DPmatrix is 2 E +1, where E is the edit distancethreshold. For example, the number of entries inthe banded DP matrix for a 2 Mb long read canbe 1.2 trillion. Parasail [34] and KSW2 (usedin minimap2 [6]) exploit both Ukkonen’s bandedalgorithm and SIMD-capable CPUs to computebanded alignment for a sequence pair with aconﬁgurable scoring function. SIMD instructionsoffer signiﬁcant parallelism to the matrix compu-tation by executing the same vector operation onmultiple operands at once. KSW2 is nearly as fastas Parasail when KSW2 does not use heuristics(explained in the next subsection).The multicore architecture of CPUs andGPUs provides the ability to compute align-ments of many independent sequence pairs con-currently. GASAL2 [35] exploits the multi-core architecture of both CPUs and GPUs forhighly-parallel computation of sequence align-ment with a user-deﬁned scoring function. Unlikeother GPU-accelerated tools, GASAL2 transfersthe bases to the GPU, without encoding theminto binary format, and hides the data trans-fer time by overlapping GPU and CPU execu-tion. GASAL2 is up to 20 × faster than Parasail(when executed with 56 CPU threads). BWA-MEM2 [23] accelerates the banded sequencealignment of its predecessor (BWA-MEM [36])by up to 11.6 × , by leveraging multicoreand SIMD parallelism. However, to achievesuch levels of acceleration, BWA-MEM2 buildsan index structure that is 6 × larger than that ofminimap2.Other designs, such as FPGASW [37], exploitthe very large number of hardware execution unitsin FPGAs to form a linear systolic array [38]. Each execution unit in the systolic array is re-sponsible for computing the value of a singleentry of the DP matrix. The systolic array com-putes a single vector of the matrix at a time. Thedata dependency between the entries restricts thesystolic array to computing the vectors sequen-tially (e.g., top-to-bottom, left-to-right, or in ananti-diagonal manner). FPGASW has a similarexecution time as its GPU implementation, butis 4 × more power efﬁcient.Specialized hardware accelerators (i.e., ASICdesigns) provide application-speciﬁc, power- andarea-efﬁcient solutions to accelerate sequencealignment. For example, GenAx [39] is composedof SillaX, a sequence alignment accelerator, anda second accelerator for ﬁnding seeds. SillaXsupports both a conﬁgurable scoring function andtraceback operations. SillaX is more efﬁcient forshort reads than for long reads, as it consistsof an automata processor whose performancescales quadratically with the edit distance. GenAxis 31.7 × faster than the predecessor of BWA-MEM2 (i.e., BWA-MEM [36]) for short reads.Recent processing-in-memory architecturessuch as RAPID [40] exploit the ability to performcomputation inside or near the memory chipto enable efﬁcient sequence alignment. RAPIDmodiﬁes the DP-based alignment algorithm tomake it friendly to in-memory parallel compu-tation by calculating two DP matrices: one forcalculating substitutions and exact matches andanother for calculating insertions and deletions.RAPID claims that this approach efﬁciently en-ables higher levels of parallelism compared totraditional DP algorithms. The main two ben-eﬁts of RAPID and such PIM-based architec-tures are higher performance and higher energyefﬁciency [8], [9], as they alleviate the needto transfer data between the main memory andthe CPU cores through slow and energy hungrybuses, while providing high degree of parallelismwith the help of PIM. RAPID is on average 11.8 × faster and 212.7 × more power efﬁcient than 384-GPU cluster of GPU implementation of sequencealignment, known as CUDAlign [41]. Heuristic-Based Alignment Accelerators

The second direction is to limit the func-tionality of the alignment algorithm or sacriﬁce the optimality of the alignment solution in order

March 2020 ohammed Alser et al. to reduce execution time. The use of restrictivefunctionality and heuristics limits the possibleapplications of the algorithms that utilize thisdirection. Examples of limiting functionality in-clude limiting the scoring function, or only takinginto account accelerating the computation of theDP matrix without performing the backtrack-ing step [42]. There are several existing algo-rithms and corresponding hardware acceleratorsthat limit scoring function ﬂexibility. Levenshteindistance [21] and Myers’s bit-vector algorithm[43] are examples of algorithms whose scoringfunctions are ﬁxed, such that they penalize alltypes of edits equally when calculating the to-tal alignment score. Restrictive scoring functionsreduce the total execution time of the alignmentalgorithm and reduce the bit-width requirement ofthe register that accommodates the value of eachentry in the DP matrix. ASAP [44] acceleratesLevenshtein distance calculation by up to 63.3 × using FPGAs compared to its CPU implementa-tion. The use of a ﬁxed scoring function as inEdlib [45], which is the state-of-the-art imple-mentation of Myerss bit-vector algorithm, helpsto outperform Parasail (which uses a ﬂexiblescoring function) by 12–1000 × . One downside ofﬁxed function scoring is that it may lead to theselection of a suboptimal sequence alignment.There are other algorithms and hardware ar-chitectures that provide low alignment time bytrading off accuracy. Darwin [16] builds a cus-tomized hardware architecture to speed up thealignment process, by dividing the DP matrix intooverlapping submatrices and processing each sub-matrix independently using systolic arrays. Dar-win provides three orders of magnitude speedupcompared to Edlib [45]. Dividing the DP ma-trix (known as the Four-Russians Method [46])enables signiﬁcant parallelism during DP matrixcomputation, but it leads to suboptimal alignmentcalculation [26]. Darwin claims that choosing alarge submatrix size ( ≥ × ≥

128 entries) between adja-cent submatrices may provide optimal alignmentcalculation for some datasets.There are other proposals that limit the num-ber of calculated entries of the DP matrix basedon one of two approaches: (1) using sparseDP or (2) using a greedy approach to main-tain a high alignment score. Both approaches suffer from providing suboptimal alignment cal-culation [47], [48]. The ﬁrst approach uses thesame sparse DP algorithm used for pre-alignmentﬁltering but as an alignment step, as done inthe exonerate tool [47]. The second approachis employed in X -drop [48], which (1) avoidscalculating entries (and their neighbors) whosealignment scores are more than X below thehighest score seen so far (where X is a user-speciﬁed parameter), and (2) stops early whena high alignment score is not possible. The X -drop algorithm is guaranteed to ﬁnd the optimalalignment between relatively-similar sequencesfor only some scoring functions [48]. A similaralgorithm (known as Z -drop) makes KSW2 atleast 2.6 × faster than Parasail. A recent GPUimplementation [49] of the X -Drop algorithm is3.1–120.4 × faster than KSW2.There are also a large number of edit dis-tance approximation algorithms that provide areduction in time complexity (e.g., O ( m )instead of O ( m )), but they suffer from providingoverestimated edit distance [50], [51], [52], [53]. Discussion and Future Opportunities

Despite more than two decades of attempts,bridging the performance gap between sequenc-ing machines and read mapping is still challeng-ing. We summarize four main challenges below.First, we need to accelerate the entire readmapping process rather than its individual steps.Accelerating only a single step of read map-ping limits the overall achieved speedup accord-ing to Amdahl’s Law. Illumina and NVIDIAhave recently started following a more holisticapproach, and they claim to accelerate genomeanalysis by more than 48 × × coverage in about 45 minutes. IEEE MICRO econd, we need to reduce the high amountof data movement that takes place during genomeanalysis. Moving data (1) between compute unitsand main memory, (2) between multiple hard-ware accelerators, and (3) between the sequenc-ing machine and the computer performing theanalysis incurs high costs in terms of executiontime and energy. These costs are a signiﬁcantbarrier to enabling efﬁcient analysis that cankeep up with sequencing technologies, and somerecent works try to tackle this problem [25],[8], [9]. GenASM [19] is a framework that usesbitvector-based ASM to accelerate multiple stepsof the genome analysis pipeline, and is designedto be implemented inside 3D-stacked memory.Through a combination of hardware–software co-design to unlock parallelism, and processing-in-memory to reduce data movement, GenASMcan perform (1) pre-alignment ﬁltering for shortreads, (2) sequence alignment for both short andlong reads, and (3) whole genome alignment,among other use cases. For short/long read align-ment, GenASM achieves 111 × /116 × speedupover state-of-the-art software read mappers whilereducing power consumption by 33 × /37 × . DRA-GEN reduces data movement between the se-quencing machine and the computer performinganalysis by adding specialized hardware supportinside the sequencing machine for data compres-sion. However, this still requires movement ofcompressed data. Performing read mapping insidethe sequencing machine itself can signiﬁcantlyimprove efﬁciency by eliminating sequencer-to-computer movement, and embedding a single spe-cialized chip for read mapping within a portablesequencing device can potentially enable newapplications of genome sequencing (e.g., rapidsurveillance of diseases such as Ebola [54] andCOVID-19 [13], near-patient testing, bringingprecision medicine to remote locations). Unfor-tunately, efforts in this direction remain verylimited.Third, we need to develop ﬂexible hardwarearchitectures that do not conservatively limit therange of supported parameter values at designtime. Commonly-used read mappers (e.g., min-imap2) have different input parameters, each ofwhich has a wide range of input values. Forexample, the edit distance threshold is typically user deﬁned and can be very high (15-20% of theread length) for recent long reads. A conﬁgurablescoring function is another example, as it deter-mines the number of bits needed to store eachentry of the DP matrix (e.g., DRAGEN imposesa restriction on the maximum frequency of seedoccurrence). Due to rapid changes in sequencingtechnologies (e.g., high sequencing error rate andlonger read lengths) [55], [56], these design re-strictions can quickly make specialized hardwareobsolete. Thus, read mappers need to adapt theiralgorithms and their hardware architectures to bemodular and scalable so that they can be imple-mented for any sequence length and edit distancethreshold based on the sequencing technology.Fourth, we need to adapt existing genomicdata formats for hardware accelerators or developmore efﬁcient ﬁle formats. Most sequencingdata is stored in the FASTQ/FASTA format,where each base takes a single byte (8 bits)of memory. This encoding is inefﬁcient, asonly 2 bits (3 bits when the ambiguous base,N, is included) are needed to encode eachDNA base. The sequencing machine convertssequenced bases into FASTQ/FASTA format,and hardware accelerators convert the ﬁlecontents into unique (for each accelerator)compact binary representations for efﬁcientprocessing. This process that requires multipleformat conversions wastes time. For example,only 43% of the sequence alignment timein BWA-MEM2 [23] is spent on calculatingthe DP matrix, while 33% of the sequencealignment time is spent on pre-processing theinput sequences for loading into SIMD registers,as provided in [23]. To address this inefﬁciency,we need to widely adopt efﬁcient hardware-friendly formats, such as UCSC’s 2bit format(https://genome.ucsc.edu/goldenPath/help/twoBit),to maximize the beneﬁts of hardware acceleratorsand reduce resource utilization. We are not awareof any recent read mapper that uses such formats.The acceleration efforts we highlight in thiswork represent state-of-the-art efforts to re-duce current bottlenecks in the genome analysispipeline. We hope that these efforts and the chal-lenges we discuss provide a foundation for futurework in accelerating read mappers and developingother genome sequence analysis tools. March 2020 ohammed Alser et al. Acknowledgments

The work of Onur Mutlu’s SAFARI ResearchGroup was supported by funding from Intel, theSemiconductor Research Corporation, VMware,and the National Institutes of Health (NIH).

REFERENCES

1. Travis C Glenn. Field Guide to Next-Generation DNASequencers.

Molecular Ecology Resources , 2011.2. Simon Ardui, Adam Ameur, Joris R Vermeesch, andMatthew S Hestand. Single Molecule Real-Time (SMRT)Sequencing Comes of Age: Applications and Utilities forMedical Diagnostics.

Nucleic Acids Research , 2018.3. Miten Jain, Sergey Koren, Karen H Miga, Josh Quick,Arthur C Rand, Thomas A Sasani, John R Tyson,Andrew D Beggs, Alexander T Dilthey, Ian T Fiddes,et al. Nanopore sequencing and assembly of a humangenome with ultra-long reads.

Nature biotechnology ,2018.4. Steve S Ho, Alexander E Urban, and Ryan E Mills. Struc-tural variation in the sequencing era.

Nature ReviewsGenetics , pages 1–19, 2019.5. Mohammed Alser, Hasan Hassan, Akash Kumar, OnurMutlu, and Can Alkan. Shouji: A Fast and Efﬁcient Pre-Alignment Filter for Sequence Alignment.

Bioinformatics ,2019.6. Heng Li. minimap2: Pairwise Alignment for NucleotideSequences.

Bioinformatics , 2018.7. Mohammed Alser, Jeremy Rotman, Kodi Taraszka,Huwenbo Shi, Pelin Icer Baykal, Harry Taegyun Yang,Victor Xue, Sergey Knyazev, Benjamin D. Singer,Brunilda Balliu, David Koslicki, Pavel Skums, Alex Ze-likovsky, Can Alkan, Onur Mutlu, and Serghei Mangul.Technology Dictates Algorithms: Recent Developmentsin Read Alignment. arXiv:2003.00110 [q-bio.GN], 2020.8. Onur Mutlu, Saugata Ghose, Juan G´omez-Luna, andRachata Ausavarungnirun. Processing data where itmakes sense: Enabling in-memory computation.

Micro-processors and Microsystems , 67:28–41, 2019.9. Saugata Ghose, Amirali Boroumand, Jeremie S Kim,Juan G´omez-Luna, and Onur Mutlu. Processing-in-memory: A workload-driven perspective.

IBM Journal ofResearch and Development , 63(6):3–1, 2019.10. Zachary D. Stephens, Skylar Y. Lee, Faraz Faghri,Roy H. Campbell, Chengxiang Zhai, Miles J. Efron, Rav-ishankar Iyer, Michael C. Schatz, Saurabh Sinha, andGene E. Robinson. Big Data: Astronomical or Genom-ical?

PLoS Biology , 2015.11. Amit Goyal, Hyuk Jung Kwon, Kichan Lee, ReenaGarg, Seon Young Yun, Yoon Hee Kim, Sunghoon Lee, and Min Seob Lee. Ultra-Fast Next GenerationHuman Genome Sequencing Data Processing UsingDRAGEN R (cid:13) Bio-IT Processor for Precision Medicine.

Open Journal of Genetics , 2017.12. Nathan LaPierre, Mohammed Alser, Eleazar Eskin,David Koslicki, and Serghei Mangul. Metalign: Efﬁcientalignment-based metagenomic proﬁling via containmentmin hash.

BioRxiv , 2020.13. Joshua S Bloom, Eric M Jones, Molly Gasperini,Nathan B Lubock, Laila Sathe, Chetan Munugala,A Sina Booeshaghi, Oliver F Brandenberg, LonghuaGuo, James Boocock, et al. Swab-seq: A high-throughput platform for massively scaled up sars-cov-2testing. medRxiv , 2020.14. Jan M Friedman, Yvonne Bombard, Martina C Cornel,Conrad V Fernandez, Anne K Junker, Sharon E Plon,Zornitza Stark, and Bartha Maria Knoppers. Genome-wide sequencing in acutely ill infants: genomic medicinescritical application?

Genetics in Medicine , 21(2):498–504, 2019.15. Ben Langmead and Abhinav Nellore. Cloud Computingfor Genomic Data Analysis and Collaboration.

NatureReviews Genetics , 2018.16. Yatish Turakhia, Gill Bejerano, and William J. Dally. Dar-win: A Genomics Co-Processor Provides Up to 15,000XAcceleration on Long Read Assembly. In

ASPLOS , 2018.17. Rasmus Nielsen, Joshua S Paul, Anders Albrechtsen,and Yun S Song. Genotype and SNP calling from next-generation sequencing data.

Nature Reviews Genetics ,12(6):443–451, 2011.18. S´ebastien Jacquemont, Alexandre Reymond, Flore Zuf-ferey, Louise Harewood, Robin G Walters, Zolt´an Ku-talik, Danielle Martinet, Yiping Shen, Armand Valsesia,Noam D Beckmann, et al. Mirror extreme BMI pheno-types associated with gene dosage at the chromosome16p11. 2 locus.

Nature , 478(7367):97–102, 2011.19. Damla Senol Cali, Gurpreet S. Kalsi, Z¨ulal Bing¨ol,Lavanya Subramanian, Can Firtina, Jeremie S. Kim,Rachata Ausavarungnirun, Mohammed Alser, AnantNori, Juan G´omez Luna, Amirali Boroumand, AllisonScibisz, Sreenivas Subramoney, Can Alkan, SaugataGhose, and Onur Mutlu. GenASM: A Low-Power,Memory-Efﬁcient Approximate String Matching Accelera-tion Framework for Genome Sequence Analysis. In

Proc.53rd Int. Symp. Microarchitecture (MICRO) , 2020.20. Daehwan Kim, Joseph M Paggi, Chanhee Park,Christopher Bennett, and Steven L Salzberg. Graph-based genome alignment and genotyping with HISAT2and HISAT-genotype.

Nature biotechnology , 37(8):907–915, 2019. IEEE MICRO

1. Vladimir I Levenshtein. Binary codes capable of cor-recting deletions, insertions, and reversals. In

SovietPhysics-Doklady , volume 10, pages 707–710, 1966.22. Ruben Langarita, Adria Armejach, Javier Setoain,Pablo Enrique Ibanez Marin, Jes´us Alastruey-Bened´e,and Miquel Moreto Planas. Compressed Sparse FM-Index: Fast Sequence Alignment Using Large K-Steps.

IEEE/ACM Transactions on Computational Biology andBioinformatics , 2020.23. Md Vasimuddin, Sanchit Misra, Heng Li, and SrinivasAluru. Efﬁcient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. In

IPDPS , 2019.24. Wenqin Huangfu, Shuangchen Li, Xing Hu, and YuanXie. RADAR: A 3D-ReRAM Based DNA AlignmentAccelerator Architecture. In

DAC , 2018.25. Jeremie S. Kim, Damla Senol Cali, Hongyi Xin,Donghyuk Lee, Saugata Ghose, Mohammed Alser,Hasan Hassan, Oguz Ergin, Can Alkan, and OnurMutlu. GRIM-Filter: Fast Seed Location Filtering in DNARead Mapping Using Processing-in-Memory Technolo-gies.

BMC Genomics , 2018.26. Guillaume Rizk and Dominique Lavenier. GASSST:Global Alignment Short Sequence Search Tool.

Bioin-formatics , 2010.27. Faraz Hach, Iman Sarraﬁ, Farhad Hormozdiari, CanAlkan, Evan E Eichler, and S Cenk Sahinalp. mrsFAST-Ultra: a compact, SNP-aware mapper for high perfor-mance sequencing applications.

Nucleic acids research ,42(W1):W494–W500, 2014.28. David Weese, Manuel Holtgrewe, and Knut Reinert.RazerS 3: faster, fully sensitive read mapping.

Bioinfor-matics , 28(20):2592–2599, 2012.29. Bo Liu, Dengfeng Guan, Mingxiang Teng, and YadongWang. rHAT: fast alignment of noisy long reads with re-gional hashing.

Bioinformatics , 32(11):1625–1631, 2016.30. Licheng Guo, Jason Lau, Zhenyuan Ruan, Peng Wei,and Jason Cong. Hardware acceleration of long readpairwise overlapping in genome sequencing: A race be-tween FPGA and GPU. In , pages 127–135. IEEE,2019.31. William J Masek and Michael S Paterson. A fasteralgorithm computing string edit distances.

Journal ofComputer and System sciences , 20(1):18–31, 1980.32. Arturs Backurs and Piotr Indyk. Edit distance cannot becomputed in strongly subquadratic time (unless SETH isfalse).

SIAM Journal on Computing , 47(3):1087–1097,2018.33. Esko Ukkonen. Algorithms for approximate string matching.

Information and control , 64(1-3):100–118,1985.34. Jeff Daily. Parasail: SIMD C Library for Global, Semi-Global, and Local Pairwise Sequence Alignments.

BMCBioinformatics , 2016.35. Nauman Ahmed, Jonathan L´evy, Shanshan Ren,Hamid Mushtaq, Koen Bertels, and Zaid Al-Ars.GASAL2: A GPU Accelerated Sequence Alignment lLi-brary for High-Throughput NGS Data.

BMC Bioinformat-ics , 2019.36. Heng Li. Aligning sequence reads, clone sequencesand assembly contigs with bwa-mem. arXiv preprintarXiv:1303.3997 , 2013.37. Xia Fei, Zou Dan, Lu Lina, Man Xin, and Zhang Chunlei.FPGASW: Accelerating Large-Scale Smith–WatermanSequence Alignment Application with Backtracking onFPGA Linear Systolic Array.

Interdisciplinary Sciences:Computational Life Sciences , 2018.38. HT Kung. Why Systolic Architectures?

IEEE Computer ,(1):37–46, 1982.39. Daichi Fujiki, Arun Subramaniyan, Tianjun Zhang,Yu Zeng, Reetuparna Das, David Blaauw, and SatishNarayanasamy. GenAx: A Genome Sequencing Accel-erator. In

ISCA , 2018.40. Saransh Gupta, Mohsen Imani, Behnam Khaleghi,Venkatesh Kumar, and Tajana Rosing. RAPID: A ReRAMProcessing in-Memory Architecture for DNA SequenceAlignment. In , pages1–6. IEEE, 2019.41. Edans Flavius de Oliveira Sandes, Guillermo Miranda,Xavier Martorell, Eduard Ayguade, George Teodoro, andAlba Cristina Magalhaes Melo. CUDAlign 4.0: Incremen-tal speculative traceback for exact chromosome-widealignment in GPU clusters.

IEEE Transactions on Paralleland Distributed Systems , 27(10):2838–2850, 2016.42. Peng Chen, Chao Wang, Xi Li, and Xuehai Zhou. Ac-celerating the Next Generation Long Read Mapping withthe FPGA-Based System.

IEEE/ACM Transactions onComputational Biology and Bioinformatics , 2014.43. Gene Myers. A fast bit-vector algorithm for approximatestring matching based on dynamic programming.

Journalof the ACM (JACM) , 46(3):395–415, 1999.44. Subho Sankar Banerjee, Mohamed El-Hadedy,Jong Bin Lim, Zbigniew T. Kalbarczyk, Deming Chen,Steven S. Lumetta, and Ravishankar K. Iyer. ASAP:Accelerated Short-Read Alignment on ProgrammableHardware.

IEEE Transactions on Computers , 2019.45. Martin ˇSoˇsi´c and Mile ˇSiki´c. Edlib: A C/C++ Library

March 2020 ohammed Alser et al. for Fast, Exact Sequence Alignment Using Edit Distance. Bioinformatics , 2017.46. Vladimir L’vovich Arlazarov, Yeﬁm A Dinitz, MA Kronrod,and IgorAleksandrovich Faradzhev. On economical con-struction of the transitive closure of an oriented graph. In

Doklady Akademii Nauk , volume 194, pages 487–488.Russian Academy of Sciences, 1970.47. Guy St. C. Slater and Ewan Birney. Automated Gener-ation of Heuristics for Biological Sequence Comparison.

BMC Bioinformatics , 2005.48. Zheng Zhang, Scott Schwartz, Lukas Wagner, andWebb Miller. A Greedy Algorithm for Aligning DNASequences.

Journal of Computational Biology , 2000.49. Alberto Zeni, Giulia Guidi, Marquita Ellis, Nan Ding,Marco D. Santambrogio, Steven Hofmeyr, Aydın Buluc¸,Leonid Oliker, and Katherine Yelick. LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment.arXiv:2002.05200 [q-bio.GN], 2020.50. Tu˘gkan Batu, Funda Ergun, and Cenk Sahinalp. Oblivi-ous string embeddings and edit distance approximations.In

Proceedings of the seventeenth annual ACM-SIAMSymposium on Discrete Algorithms (SODA) , pages 792–801. Society for Industrial and Applied Mathematics,2006.51. Alexandr Andoni and Krzysztof Onak. Approximatingedit distance in near-linear time.

SIAM Journal on Com-puting , 41(6):1635–1648, 2012.52. Diptarka Chakraborty, Debarati Das, Elazar Golden-berg, Michal Koucky, and Michael Saks. Approximatingedit distance within constant factor in truly sub-quadratictime. In , pages 979–990.IEEE, 2018.53. Moses Charikar, Oﬁr Geri, Michael P Kim, and WilliamKuszmaul. On Estimating Edit Distance: Alignment,Dimension Reduction, and Embeddings. In . Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.54. Joshua Quick, Nicholas J Loman, Sophie Duraf-four, Jared T Simpson, Ettore Severi, Lauren Cowley,Joseph Akoi Bore, Raymond Koundouno, Gytis Dudas,Amy Mikhail, et al. Real-time, portable genome sequenc-ing for Ebola surveillance.

Nature , 530(7589):228–232,2016.55. Damla Senol Cali, Jeremie S Kim, Saugata Ghose,Can Alkan, and Onur Mutlu. Nanopore sequencingtechnology and tools for genome assembly: Computa-tional analysis of the current state, bottlenecks and future directions.

Brieﬁngs in Bioinformatics , 20(4):1542–1559,2019.56. Can Firtina, Jeremie S Kim, Mohammed Alser, DamlaSenol Cali, A Ercument Cicek, Can Alkan, and OnurMutlu. Apollo: a sequencing-technology-independent,scalable and accurate assembly polishing algorithm.

Bioinformatics , 36(12):3669–3679, 2020.

Mohammed Alser is with ETH Zrich, Switzerland.Contact him at [email protected]

Zlal Bingl is with Bilkent University, Turkey. Contacther at [email protected]

Damla Senol Cali is with Carnegie Mellon University,USA. Contact her at [email protected]

Jeremie Kim is with ETH Zrich, Switzerland, andCarnegie Mellon University, USA. Contact him [email protected]

Saugata Ghose is with the University of Illinois atUrbana–Champaign and Carnegie Mellon University,USA. Contact him at [email protected]

Can Alkan is with Bilkent University, Turkey. Contacthim at [email protected]

Onur Mutlu is with ETH Zrich, Switzerland, CarnegieMellon University, USA, and Bilkent University,Turkey. Contact him at [email protected]