[PDF] GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis

Abstract

Genome sequence analysis has enabled significant advancements in medical and scientific areas such as personalized medicine, outbreak tracing, and the understanding of evolution. Unfortunately, it is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems, as many of the steps in genome sequence analysis must process a large amount of data. A major contributor to this bottleneck is approximate string matching (ASM). We propose GenASM, the first ASM acceleration framework for genome sequence analysis. We modify the underlying ASM algorithm (Bitap) to significantly increase its parallelism and reduce its memory footprint, and we design the first hardware accelerator for Bitap. Our hardware accelerator consists of specialized compute units and on-chip SRAMs that are designed to match the rate of computation with memory capacity and bandwidth. We demonstrate that GenASM is a flexible, high-performance, and low-power framework, which provides significant performance and power benefits for three different use cases in genome sequence analysis: 1) GenASM accelerates read alignment for both long reads and short reads. For long reads, GenASM outperforms state-of-the-art software and hardware accelerators by 116x and 3.9x, respectively, while consuming 37x and 2.7x less power. For short reads, GenASM outperforms state-of-the-art software and hardware accelerators by 111x and 1.9x. 2) GenASM accelerates pre-alignment filtering for short reads, with 3.7x the performance of a state-of-the-art pre-alignment filter, while consuming 1.7x less power and significantly improving the filtering accuracy. 3) GenASM accelerates edit distance calculation, with 22-12501x and 9.3-400x speedups over the state-of-the-art software library and FPGA-based accelerator, respectively, while consuming 548-582x and 67x less power.

Full PDF

GGenASM: A High-Performance, Low-PowerApproximate String Matching Acceleration Frameworkfor Genome Sequence Analysis

Damla Senol Cali † (cid:111)(cid:110) Gurpreet S. Kalsi (cid:111)(cid:110)

Zülal Bingöl (cid:79)

Can Firtina (cid:5)

Lavanya Subramanian ‡ Jeremie S. Kim (cid:5)†

Rachata Ausavarungnirun (cid:12)

Mohammed Alser (cid:5)

Juan Gomez-Luna (cid:5)

Amirali Boroumand † Anant Nori (cid:111)(cid:110)

Allison Scibisz † Sreenivas Subramoney (cid:111)(cid:110)

Can Alkan (cid:79)

Saugata Ghose ? † Onur Mutlu (cid:5)† (cid:79) † Carnegie Mellon University (cid:111)(cid:110)

Processor Architecture Research Lab, Intel Labs (cid:79)

Bilkent University (cid:5)

ETH Zürich ‡ Facebook (cid:12)

King Mongkut’s University of Technology North Bangkok ? University of Illinois at Urbana–ChampaignGenome sequence analysis has enabled significant advance-ments in medical and scientific areas such as personalizedmedicine, outbreak tracing, and the understanding of evolution.To perform genome sequencing, devices extract small randomfragments of an organism’s DNA sequence (known as reads ).The first step of genome sequence analysis is a computationalprocess known as read mapping . In read mapping, each frag-ment is matched to its potential location in the reference genomewith the goal of identifying the original location of each readin the genome. Unfortunately, rapid genome sequencing is cur-rently bottlenecked by the computational power and memorybandwidth limitations of existing systems, as many of the stepsin genome sequence analysis must process a large amount ofdata. A major contributor to this bottleneck is approximatestring matching (ASM), which is used at multiple points duringthe mapping process. ASM enables read mapping to accountfor sequencing errors and genetic variations in the reads.We propose GenASM, the first ASM acceleration frameworkfor genome sequence analysis. GenASM performs bitvector-based ASM, which can efficiently accelerate multiple steps ofgenome sequence analysis. We modify the underlying ASMalgorithm (Bitap) to significantly increase its parallelism andreduce its memory footprint. Using this modified algorithm, wedesign the first hardware accelerator for Bitap. Our hardwareaccelerator consists of specialized systolic-array-based computeunits and on-chip SRAMs that are designed to match the rate ofcomputation with memory capacity and bandwidth, resultingin an efficient design whose performance scales linearly as weincrease the number of compute units working in parallel.We demonstrate that GenASM provides significant perfor-mance and power benefits for three different use cases in genomesequence analysis. First, GenASM accelerates read alignmentfor both long reads and short reads. For long reads, GenASM out-performs state-of-the-art software and hardware acceleratorsby 116 × and × , respectively, while reducing power consump-tion by × and 2.7 × . For short reads, GenASM outperformsstate-of-the-art software and hardware accelerators by × and × . Second, GenASM accelerates pre-alignment filteringfor short reads, with × the performance of a state-of-the-art pre-alignment filter, while reducing power consumption by × and significantly improving the filtering accuracy. Third,GenASM accelerates edit distance calculation, with 22–12501 × and 9.3–400 × speedups over the state-of-the-art software li-brary and FPGA-based accelerator, respectively, while reducingpower consumption by 548–582 × and × . We conclude thatGenASM is a flexible, high-performance, and low-power frame-work, and we briefly discuss four other use cases that can benefitfrom GenASM.

1. Introduction

Genome sequencing, which determines the DNA sequenceof an organism, plays a pivotal role in enabling many medi-cal and scientific advancements in personalized medicine[6, 20, 34, 53, 59], evolutionary theory [46, 139, 140], andforensics [17, 25, 179]. Modern genome sequencing ma-chines [77–79, 132–135, 152] can rapidly generate massive amounts of genomics data at low cost [8, 118, 153], but areunable to extract an organism’s complete DNA in one piece.Instead, these machines extract smaller random fragmentsof the original DNA sequence, known as reads . These readsthen pass through a computational process known as readmapping , which takes each read, aligns it to one or morepossible locations within the reference genome, and finds thematches and differences (i.e., distance ) between the read andthe reference genome segment at that location [6, 177]. Readmapping is the first key step in genome sequence analysis.State-of-the-art sequencing machines produce broadly oneof two kinds of reads.

Short reads (consisting of no morethan a few hundred DNA base pairs [30, 158]) are generatedusing short-read sequencing (SRS) technologies [144, 164],which have been on the market for more than a decade. Be-cause each read fragment is so short compared to the entireDNA (e.g., a human’s DNA consists of over 3 billion basepairs [166]), short reads incur a number of reproducibility(e.g., non-deterministic mapping) and computational chal-lenges [7, 10, 12, 52, 118, 159, 176–178].

Long reads (consist-ing of thousands to millions of DNA base pairs) are gener-ated using long-read sequencing (LRS) technologies, of whichOxford Nanopore Technologies’ (ONT) nanopore sequenc-ing [26, 35, 40, 82, 83, 89, 97, 112, 113, 116, 143, 152] and Pa-cific Biosciences’ (PacBio) single-molecule real-time (SMRT)sequencing [18, 47, 114, 123, 145, 146, 165, 171] are the mostwidely used ones. LRS technologies are relatively new, andthey avoid many of the challenges faced by short reads.LRS technologies have three key advantages comparedto SRS technologies. First, LRS devices can generate verylong reads, which (1) reduces the non-deterministic mappingproblem faced by short reads, as long reads are significantlymore likely to be unique and therefore have fewer potentialmapping locations in the reference genome; and (2) spanlarger parts of the repeated or complex regions of a genome,enabling detection of genetic variations that might exist inthese regions [165]. Second, LRS devices perform real-timesequencing, and can enable concurrent sequencing and anal-ysis [111, 142, 146]. Third, ONT’s pocket-sized device (Min-ION [133]) provides portability, making sequencing possibleat remote places using laptops or mobile devices. This en-ables a number of new applications, such as rapid infectiondiagnosis and outbreak tracing (e.g., COVID-19, Ebola, Zika,swine flu [37, 48, 64, 68, 85, 142, 167, 173]). Unfortunately, LRSdevices are much more error-prone in sequencing (with atypical error rate of 10–15% [19, 83, 165, 170]) compared toSRS devices (typically 0.1% [60, 61, 141]), which leads to newcomputational challenges [152].For both short and long reads, multiple steps of read map-ping must account for the sequencing errors, and for the dif-ferences caused by genetic mutations and variations. Theseerrors and differences take the form of base insertions, dele-tions, and/or substitutions [121,125,154,163,169,174]. As a re-sult, read mapping must perform approximate (or fuzzy ) stringmatching (ASM). Several algorithms exist for ASM, but state-of-the-art read mapping tools typically make use of an expen- a r X i v : . [ c s . A R ] S e p ive dynamic programming based algorithm [100, 126, 154]that scales quadratically in both execution time and requiredstorage. This ASM algorithm has been shown to be the ma-jor bottleneck in read mapping [8, 10, 55, 66, 75, 122, 162].Unfortunately, as sequencing technologies advance, thegrowth in the rate that sequencing devices generate readsis far outpacing the corresponding growth in computationalpower [8, 32], placing greater pressure on the ASM bottle-neck. Beyond read mapping, ASM is a key technique for otherbioinformatics problems such as whole genome alignment(WGA) [27,28,41,42,70,95,102,106,115,151,160] and multiplesequence alignment (MSA) [29,45,69,98,107,127,128,136,150],where two or more whole genomes, or regions of multiplegenomes (from the same or different species), are comparedto determine their similarity for predicting evolutionary re-lationships or finding common regions (e.g., genes). Thus,there is a pressing need to develop techniques for genomesequence analysis that provide fast and efficient ASM.In this work, we propose GenASM , an ASM accelerationframework for genome sequence analysis. Our goal is to de-sign a fast, efficient, and flexible framework for both shortand long reads, which can be used to accelerate multiple steps of the genome sequence analysis pipeline. To avoid imple-menting more complex hardware for the dynamic program-ming based algorithm [22, 33, 49, 65, 87, 88, 147, 162], we baseGenASM upon the

Bitap algorithm [21, 174]. Bitap uses onlyfast and simple bitwise operations to perform approximatestring matching, making it amenable to efficient hardwareacceleration. To our knowledge, GenASM is the first workthat enhances and accelerates Bitap.To use Bitap for GenASM, we make two key algorithmicmodifications that allow us to overcome key limitations thatprevent the original Bitap algorithm from being efficient forgenome sequence analysis (we discuss these limitations inSection 2.3). First, to improve Bitap’s applicability to differentsequencing technologies and its performance, we (1) modifythe algorithm to support long reads (in addition to alreadysupported short reads), and (2) eliminate loop-carried datadependencies so that we can parallelize a single string match-ing operation. Second, we develop a novel Bitap-compatiblealgorithm for traceback , a method that utilizes informationcollected during ASM about the different types of errors toidentify the optimal alignment of reads. The original Bitapalgorithm is not capable of performing traceback.In GenASM, we co-design our modified Bitap algorithm andour new Bitap-compatible traceback algorithm with an area-and power-efficient hardware accelerator, which consists oftwo components: (1)

GenASM-DC , which provides hardwaresupport to efficiently execute our modified Bitap algorithm togenerate bitvectors (each of which represents one of the fourpossible cases: match, insertion, deletion, or substitution) andperform distance calculation (DC) (which calculates the min-imum number of errors between the read and the referencesegment); and (2)

GenASM-TB , which provides hardware sup-port to efficiently execute our novel traceback (TB) algorithmto find the optimal alignment of a read, using the bitvectorsgenerated by GenASM-DC. Our hardware accelerator (1) bal-ances the compute resources with available memory capacityand bandwidth per compute unit to avoid wasting resources,(2) achieves high performance and power efficiency by usingspecialized compute units that we design to exploit data lo-cality, and (3) scales linearly in performance with the numberof parallel compute units that we add to the system.

Use Cases.

GenASM is an efficient framework for accel-erating genome sequence analysis that has multiple possibleuse cases. In this paper, we describe and rigorously evaluatethree use cases of GenASM. First, we show that GenASM can effectively accelerate the read alignment step of read map-ping (Section 10.2). Second, we illustrate that GenASM canbe employed as the most efficient (to date) pre-alignmentfilter [9, 10] for short reads (Section 10.3). Third, we demon-strate how GenASM can efficiently find the edit distance (i.e.,Levenshtein distance [100]) between two sequences of ar-bitrary lengths (Section 10.4). In addition, GenASM can beutilized in several other parts of genome sequence analysis aswell as in text analysis, which we briefly discuss in Section 11.

Results Summary.

We evaluate GenASM for three dif-ferent use cases of ASM in genome sequence analysis usinga combination of the synthesized SystemVerilog model ofour hardware accelerators and detailed simulation-based per-formance modeling. (1) For read alignment, we compareGenASM to state-of-the-art software (Minimap2 [102] andBWA-MEM [101]) and hardware approaches (GACT in Dar-win [162] and SillaX in GenAx [55]), and find that GenASM issignificantly more efficient in terms of both speed and powerconsumption. For this use case, we compare GenASM only with the read alignment steps of the baseline tools and accel-erators. For long reads, GenASM achieves 116 × and 648 × speedup over 12-thread runs of the alignment steps of Min-imap2 and BWA-MEM, respectively, while reducing powerconsumption by 37 × and 34 × . Compared to GACT, GenASMprovides 6.6 × the throughput per unit area and 10.5 × thethroughput per unit power for long reads. For short reads,GenASM achieves 158 × and 111 × speedup over 12-threadruns of the alignment steps of Minimap2 and BWA-MEM,respectively, while reducing power consumption by 31 × and33 × . Compared to SillaX, GenASM is 1.9 × faster at a com-parable area and power consumption. (2) For pre-alignmentfiltering of short reads, we compare GenASM with a state-of-the-art FPGA-based filter, Shouji [9]. GenASM provides 3.7 × speedup over Shouji, while reducing power consumption by1.7 × , and also significantly improving the filtering accuracy.(3) For edit distance calculation, we compare GenASM witha state-of-the-art software library, Edlib [155], and FPGA-based accelerator, ASAP [22]. Compared to Edlib, GenASMprovides 22–12501 × speedup, for varying sequence lengthsand similarity values, while reducing power consumption by548–582 × . Compared to ASAP, GenASM provides 9.3–400 × speedup, while reducing power consumption by 67 × .This paper makes the following contributions:• To our knowledge, GenASM is the first work that enhancesand accelerates the Bitap algorithm for approximate stringmatching. We modify Bitap to add efficient support forlong reads and enable parallelism within each ASM opera-tion. We also propose the first Bitap-compatible tracebackalgorithm. We open source our software implementationsof the GenASM algorithms [148].• We present GenASM, a novel approximate string match-ing acceleration framework for genome sequence analysis.GenASM is a power- and area-efficient hardware imple-mentation of our new Bitap-based algorithms.• We show that GenASM can accelerate three use cases ofapproximate string matching (ASM) in genome sequenceanalysis (i.e., read alignment, pre-alignment filtering, editdistance calculation). We find that GenASM is greatly fasterand more power-efficient for all three use cases than state-of-the-art software and hardware baselines.

2. Background

A common approach to the first step in genome sequenceanalysis is to perform read mapping , where each read of anorganism’s sequenced genome is matched against the ref-erence genome for the organism’s species to find the read’sriginal location. As Figure 1 shows, typical read map-ping [6, 96, 101, 102, 105, 177] is a four-step process. First,read mapping starts with indexing , which is an offline pre-processing step performed on a known reference genome.Second, once a sequencing machine generates reads froma DNA sequence, the seeding process queries the indexstructure to determine the candidate (i.e., potential) map-ping locations of each read in the reference genome usingsubstrings (i.e., seeds ) from each read. Third, for each read, pre-alignment filtering uses filtering heuristics to examinethe similarity between a read and the portion of the referencegenome at each of the read’s candidate mapping locations.These filtering heuristics aim to eliminate most of the dissimi-lar pairs of reads and candidate mapping locations to decreasethe number of required alignments in the next step. Fourth,for all of the remaining candidate mapping locations, readalignment runs a dynamic programming based algorithmto determine which of the candidate mapping locations inthe reference matches best with the input read. As part ofthis step, traceback is performed between the reference andthe input read to find the optimal alignment , which is thealignment with the highest likelihood of being correct (basedon a scoring function [62, 117, 168]). The optimal alignmentis defined using a CIGAR string [103], which shows the se-quence and position of each match, substitution, insertion,and deletion for the read with respect to the selected mappinglocation of the reference.

IndexingSeedingPre-Alignment FilteringRead Alignment

Referencegenome Hash-table based indexPotential mapping locationsOptimal alignmentNon-filtered candidate mapping locationsReadsReferencesegmentQuery read

IndexingSeedingPre-Alignment FilteringRead Alignment

Reference genome

Hash table based index (pre-processed)Candidate mapping locations

Optimal alignment

Remaining candidate mapping locations

Reads from sequenced genome

Indexing SeedingPre-Alignment Filtering Read Alignment

Reference genome

Hash table based index (pre-processed) Candidate mapping locations

Optimal alignment

Remaining candidate mapping locations

Reads from sequenced genome

Figure 1. Four steps of read mapping.

The goal of approximate string matching [125] is to de-tect the differences and similarities between two sequences.Given a query read sequence Q =[ q q . . . q m ], a reference textsequence T =[ t t . . . t n ] (where m = | Q | , n = | T | , n ≥ m ), andan edit distance threshold E , the approximate string matchingproblem is to identify a set of approximate matches of Q in T (allowing for at most E differences). The differences betweentwo sequences of the same species can result from sequenc-ing errors [18, 54] and/or genetic variations [5, 50]. Reads areprone to sequencing errors, which account for about 0.1%of the length of short reads [60, 61, 141] and 10–15% of thelength of long reads [19, 83, 165, 170].The differences, known as edits , can be classified as substi-tutions , deletions , or insertions in one or both sequences [100].Figure 2 shows each possible kind of edit. In ASM, to detect adeleted character or an inserted character, we need to exam-ine all possible prefixes (i.e., substrings that include the firstcharacter of the string) or suffixes (i.e., substrings that includethe last character of the string) of the two input sequences,and keep track of the pairs of prefixes or suffixes that providethe minimum number of edits.Approximate string matching is needed not only to deter-mine the minimum number of edits between two genomic se-quences, but also to provide the location and type of each edit.As two sequences could have a large number of different pos-sible arrangements of the edit operations and matches (andhence different alignments ), the approximate string matchingalgorithm usually involves a traceback step. The alignment A A A AT G T T TA G T G C TA C T T GA A A AT G T T TA G T G C TA C T T G

Reference:Read: insertionsubstitutiondeletion C Figure 2. Three types of errors (i.e., edits). score is the sum of all edit penalties and match scores alongthe alignment, as defined by a user-specified scoring function.This step finds the optimal alignment as the combination ofedit operations to build up the highest alignment score.Approximate string matching is typically implemented as adynamic programming based algorithm. Existing implemen-tations, such as Levenshtein distance [100], Smith-Waterman[154], and Needleman-Wunsch [126], have quadratic timeand space complexity (i.e., O ( m × n ) between two sequenceswith lengths m and n ). Therefore, it is desirable to find lower-complexity algorithms for ASM. One candidate to replace dynamic programming basedalgorithms for ASM is the

Bitap algorithm [21, 174]. Bitaptackles the problem of computing the minimum edit distancebetween a reference text (e.g., reference genome) and a querypattern (e.g., read) with a maximum of k many errors. When k is 0, the algorithm finds the exact matches.Algorithm 1 shows the Bitap algorithm and Figure 3 showsan example for the execution of the algorithm. The algorithmstarts with a pre-processing procedure (Line 4 in Algorithm 1; in Figure 3) that converts the query pattern into m -sizedpattern bitmasks, PM . We generate one pattern bitmask foreach character in the alphabet. Since 0 means match in theBitap algorithm, we set PM [ a ][ i ] = 0 when pattern [ i ] = a ,where a is a character from the alphabet (e.g., A, C, G, T).These pattern bitmasks help us to represent the query patternin a binary format. After the bitmasks are prepared for eachcharacter, every bit of all status bitvectors ( R [ d ], where d isin range [0, k ]) is initialized to 1 (Lines 5–6 in Algorithm 1; in Figure 3). Each R [ d ] bitvector at text iteration i holds thepartial match information between text [ i : ( n –1)] (Line 8) andthe query with maximum of d errors. Since at the beginningof the execution there are no matches, we initialize all statusbitvectors with 1s. The status bitvectors of the previous itera-tion with edit distance d is kept in oldR [ d ] (Lines 10–11) totake partial matches into consideration in the next iterations.The algorithm examines each text character one by one,one per iteration. At each text iteration ( – ), the pat-tern bitmask of the current text character ( PM ) is retrieved(Line 12). After the status bitvector for exact match is com-puted ( R [0]; Line 13), the status bitvectors for each distance( R [ d ]; d = 1... k ) are computed using the rules in Lines 15–19.For a distance d , three intermediate bitvectors for the errorcases (one each for deletion, insertion, substitution; D/I/S inFigure 3) are calculated by using oldR [ d – 1] or R [ d – 1], sincea new error is being added (i.e., the distance is increasing by1), while the intermediate bitvector for the match case (M)is calculated using oldR [ d ]. For a deletion (Line 15), we arelooking for a string match if the current pattern characteris missing, so we copy the partial match information of theprevious character ( oldR [ d – 1]; consuming a text character) without any shifting ( not consuming a pattern character) toserve as the deletion bitvector (labeled as D of R – ). For a substitution (Line 16), we are looking for astring match if the current pattern character and the currenttext character do not match, so we take the partial matchinformation of the previous character ( oldR [ d – 1]; consum-ing a text character) and shift it left by one (consuming apattern character) before saving it as the substitution bitvec-tor (labeled as S of R – ). For an insertion(Line 17), we are looking for a string match if the current lgorithm 1 Bitap Algorithm

Inputs: text (reference), pattern (query), k (edit distance threshold) Outputs: startLoc (matching location), editDist (minimum edit distance)1: n ← length of reference text m ← length of query pattern procedure Pre-Processing4: PM ← generatePatternBitmaskACGT(pattern) . pre-process the pattern5: for d in 0:k do R[d] ← . initialize R bitvectors to 1s7: procedure Edit Distance Calculation8: for i in (n-1):-1:0 do . iterate over each text character9: curChar ← text[i] for d in 0:k do oldR[d] ← R[d] . copy previous iterations’ bitvectors as oldR12: curPM ← PM[curChar] . retrieve the pattern bitmask13: R[0] ← (oldR[0]<< | curPM . status bitvector for exact match14: for d in 1:k do . iterate over each edit distance15: deletion (D) ← oldR[d-1] substitution (S) ← (oldR[d-1]<<1) insertion (I) ← (R[d-1]<<1) match (M) ← (oldR[d]<<1) | curPM R[d] ← D & S & I & M . status bitvector for d errors20: if MSB of R[d] == 0, where 0 ≤ d ≤ k . check if MSB is 021: startLoc ← i . matching location22: editDist ← d . found minimum edit distance PREPROCESSING

Pattern Bitmasks: CTGAPM(A) = 111 = PM(G) = 11 PM(T) = 1 State Vectors:R0 = 1111 R1 = 1111 Text[4]: CGTGA oldR0 = 1111 oldR1 = 1111 R0 = (oldR0 << 1) | PM( A ) = 1110 R1 = = D & S & I & M = 1100 D : oldR0 = 1111 S : oldR0 << 1 = 1110 I : R0 << 1 = 1100 M : (oldR1 << 1) | PM( A ) = 1110 Text[3]: CGTGA oldR0 = 1110 oldR1 = 1100 R0 = (oldR0 << 1) | PM( G ) = 1101 R1 = = D & S & I & M = 1000 D : oldR0 = 1110 S : oldR0 << 1 = 1100 I : R0 << 1 = 1010 M : (oldR1 << 1) | PM( G ) = 1101 Text[2]: CGTGA oldR0 = 1101 oldR1 = 1000 R0 = (oldR0 << 1) | PM( T ) = 1011 R1 = = D & S & I & M = D : oldR0 = 1101 S : oldR0 << 1 = 1010 I : R0 << 1 = M : (oldR1 << 1) | PM( T ) = 1011 Alignment Found @ Location=2

Text[1]: CGTGA oldR0 = 1011 oldR1 = 0000 R0 = (oldR0 << 1) | PM( G ) = 1111 R1 = = D & S & I & M = D : oldR0 = 1011 S : oldR0 << 1 = I : R0 << 1 = 1110 M : (oldR1 << 1) | PM( G ) = 1101 Alignment Found @ Location=1

Text[0]: CGTGA oldR0 = 1111 oldR1 = 0000 R0 = (oldR0 << 1) | PM( C ) = 1111 R1 = = D & S & I & M = D : oldR0 = 1111 S : oldR0 << 1 = 1110 I : R0 << 1 = 1110 M : (oldR1 << 1) | PM( C ) = Alignment Found @ Location=0Text Region:

CGTGA

Query Pattern:

CTGA

Edit Distance Threshold (k): Figure 3. Example for the Bitap algorithm. text character is missing, so we copy the partial match infor-mation of the current character ( R [ d – 1]; not consuming atext character) and shift it left by one (consuming a patterncharacter) before saving it as the insertion bitvector (labeledas I of R – ). For a match (Line 18), weare looking for a string match only if the current patterncharacter matches the current text character, so we take thepartial match information of the previous character ( oldR [ d ];consuming a text character but not increasing the edit dis-tance), shift it left by one (consuming a pattern character),and perform an OR operation with the pattern bitmask ofthe current text character ( curPM ; comparing the text char-acter and the pattern character) before saving the result asthe match bitvector (labeled as R M of R – ).After computing all four intermediate bitvectors, in orderto take all possible partial matches into consideration, we per-form an AND operation (Line 19) with these four bitvectorsto preserve all 0s that exist in any of them (i.e., all potentiallocations for a string match with an edit distance of d upto this point). We save the ANDed result as the R [ d ] statusbitvector for the current iteration. This process is repeatedfor each potential edit distance value from 0 to k . If the mostsignificant bit of the R [ d ] bitvector becomes 0 (Lines 20–22),then there is a match starting at position i of the text with anedit distance d (as shown in – ). The traversal of the textthen continues until all possible text positions are examined.

3. Motivation and Goals

Although the Bitap algorithm is highly suitable for hard-ware acceleration due to the simple nature of its bitwise op-erations, we find that it has five limitations that hinder itsapplicability and efficient hardware acceleration for genomeanalysis. In this section, we discuss each of these limitations. In order to overcome these limitations and design an effec-tive and efficient accelerator, we find that we need to both(1) modify and extend the Bitap algorithm and (2) developspecialized hardware that can exploit the new opportunitiesthat our algorithmic modifications provide.

No Support for Long Reads.

In state-of-the-art imple-mentations of Bitap, the query length is limited by the wordsize of the machine running the algorithm. This is due to(1) the fact that the bitvector length must be equal to the querylength, and (2) the need to perform bitwise operations on thebitvectors. By limiting the bitvector length to a word, eachbitwise operation can be done using a single CPU instruction.Unfortunately, the lack of multi-word queries prevents theseimplementations from working for long reads, whose lengthsare on the order of thousands to millions of base pairs (whichrequire thousands of bits to store).

Data Dependency Between Iterations.

As we show inSection 2.3, the computed bitvectors at each text iteration(i.e., R[d]) of the Bitap algorithm depend on the bitvectorscomputed in the previous text iteration (i.e., oldR[d-1] andoldR[d]; Lines 11, 13, 15, 16, and 18 of Algorithm 1). Fur-thermore, for each text character, there is an inner loop thatiterates for the maximum edit distance number of iterations(Line 14). The bitvectors computed in each of these inneriterations (i.e., R[d]) are also dependent on the previous inneriteration’s computed bitvectors (i.e., R[d-1]; Line 17). Thistwo-level data dependency forces the consecutive iterationsto take place sequentially.

No Support for Traceback.

Although the baseline Bitapalgorithm can find possible matching locations of each queryread within the reference text, this covers only the first step ofapproximate string matching required for genome sequenceanalysis. Since there could be multiple different alignmentsbetween the read and the reference, the traceback opera-tion [14, 51, 62, 63, 117, 120, 154, 163, 168, 169] is needed tofind the optimal alignment , which is the alignment with theminimum edit distance (or with the highest score based ona user-defined scoring function). However, Bitap does notinclude any such support for optimal alignment identification.

Limited Compute Parallelism.

Even after we solve thealgorithmic limitations of Bitap, we find that we cannot ex-tract significant performance benefits with just algorithmicenhancements alone. For example, while Bitap iterates overeach character of the input text sequentially (Line 8), wecan enable text-level parallelism to improve its performance(Section 5). However, the achievable level of parallelism islimited by the number of compute units in existing systems.For example, our studies show that Bitap is bottlenecked bycomputation on CPUs, since the working set fits within theprivate caches but the limited number of cores prevents thefurther speedup of the algorithm.

Limited Memory Bandwidth.

We would expect that aGPU, which has thousands of compute units, can overcomethe limited compute parallelism issues that CPUs experience.However, we find that a GPU implementation of the Bitapalgorithm suffers from the limited amount of memory band-width available for each GPU thread. Even when we run aCUDA implementation of the baseline Bitap algorithm [104],whose bandwidth requirements are significantly lower thanour modified algorithm, the limited memory bandwidth bot-tlenecks the algorithm’s performance. We find that the bot-tleneck is exacerbated after the number of threads per blockreaches 32, as Bitap becomes shared cache-bound (i.e., on-GPU L2 cache-bound). The small number of registers becomesinsufficient to hold the intermediate data required for Bitapexecution. Furthermore, when the working set of a threadoes not fit within the private memory of the thread, destruc-tive interference between threads while accessing the sharedmemory creates bottlenecks in the algorithm on GPUs. Weexpect these issues to worsen when we implement traceback,which requires significantly higher bandwidth than Bitap.

Our goal in this work is to overcome these limitations anduse Bitap in a fast, efficient, and flexible ASM frameworkfor both short and long reads. We find that this goal cannotbe achieved by modifying only the algorithm or only thehardware. We design

GenASM , the first ASM accelerationframework for genome sequence analysis. Through carefulmodification and co-design of the enhanced Bitap algorithmand hardware, GenASM aims to successfully replace the ex-pensive dynamic programming based algorithm used for ASMin genomics with the efficient bitwise-operation-based Bitapalgorithm, which can accelerate multiple steps of genomesequence analysis.

4. GenASM: A High-Level Overview

In GenASM, we co-design our modified Bitap algorithmfor distance calculation (DC) and our new Bitap-compatibletraceback (TB) algorithm with an area- and power-efficienthardware accelerator. GenASM consists of two components,as shown in Figure 4: (1) GenASM-DC (Section 5), which foreach read generates the bitvectors and performs the minimumedit distance calculation (DC); and (2) GenASM-TB (Section 6),which uses the bitvectors to perform traceback (TB) and findthe optimal alignment. GenASM is a flexible framework thatcan be used for different use cases (Section 8).GenASM execution starts when the host CPU issues a taskto GenASM with the reference and the query sequences’ loca-tions ( in Figure 4). GenASM-DC reads the correspondingreference text region and the query pattern from the memory.GenASM-DC then writes these to its dedicated SRAM, whichwe call DC-SRAM ( ). After that, GenASM-DC divides thereference text (e.g., reference genome) and query pattern (e.g.,read) into multiple overlapping windows ( ), and for each sub-text (i.e., the portion of the reference text in one win-dow) and sub-pattern (i.e., the portion of the query patternin one window), GenASM-DC searches for the sub-patternwithin the sub-text and generates the bitvectors ( ). Eachprocessing element (PE) of GenASM-DC writes the gener-ated bitvectors to its own dedicated SRAM, which we callTB-SRAM ( ). Once GenASM-DC completes its search forthe current window, GenASM-TB starts reading the storedbitvectors from TB-SRAMs ( ) and generates the window’straceback output ( ). Once GenASM-TB generates this out-put, GenASM computes the next window and repeats Steps – until all windows are completed.Our hardware accelerators are designed to maximize par-allelism and minimize memory footprint. Our modifiedGenASM-DC algorithm is highly parallelizable, and performsonly simple and regular bitwise operations, so we implementthe GenASM-DC accelerator as a systolic array based accelera-tor. GenASM-TB accelerator requires simple logic operations Host CPU GenASM-TBAcceleratorGenASM-DCAcceleratorMain Memory reference & query locations Write bitvectorsreference text & query pattern

DC-SRAM sub-text & sub-pattern Read bitvectors Find the traceback output

DC-Controller

Generate bitvectors

GenASM-DC GenASM-TB

TB-SRAM TB-SRAM TB-SRAM n ...

21 3 4 5 6 7

Figure 4. Overview of GenASM. to perform the TB-SRAM accesses and the required controlflow to complete the traceback operation. Both of our hard-ware accelerators are highly efficient in terms of area andpower. We discuss them in detail in Section 7.

5. GenASM-DC Algorithm

We modify the baseline Bitap algorithm (Section 2.3) to(1) enable efficient alignment of long reads, (2) remove thedata dependency between the iterations, and (3) provide par-allelism for the large number of iterations.

Long Read Support.

The GenASM-DC algorithm over-comes the word-length limit of Bitap (Section 3.1) by storingthe bitvectors in multiple words when the query is longerthan the word size. Although this modification leads to addi-tional computation when performing shifts, it helps GenASMto support both short and long reads. When shifting word i ofa multi-word bitvector, the bit shifted out (MSB) of word i – 1needs to be stored separately before performing the shift onword i – 1. Then, that saved bit needs to be loaded as the leastsignificant bit (LSB) of word i when the shift occurs. Thiscauses the complexity of the algorithm to be d mw e × n × k ,where m is the query length, w is the word size, n is the textlength, and k is the edit distance. Loop Dependency Removal.

In order to solve the two-level data dependency limitation of the baseline Bitap algo-rithm (Section 3.1), GenASM-DC performs loop unrolling andenables computing non-neighbor (i.e., independent) bitvec-tors in parallel. Figure 5 shows an example for unrolling withfour threads for text characters T0–T3 and status bitvectorsR0–R7. For the iteration where R [ d ] represents T2–R2 (i.e.,the target cell shaded in dark red), R [ d – 1] refers to T2–R1, oldR [ d – 1] refers to T1–R1, and oldR [ d ] refers to T1–R2 (i.e.,cells T2–R2 is dependent on, shaded in light red). Based onthis example, T2–R2 depends on T1–R2, T2–R1, and T1–R1,but it does not depend on T3–R1, T1–R3, or T0–R4. Thus,these independent bitvectors can be computed in parallelwithout waiting for one another. Cycle

Thread R0/4

Thread R1/5

Thread R2/6

Thread R3/7 d )cells target cell depends on (oldR d , R d-1 , oldR d-1 ) data written to memory data read from memory Cycle

Thread R0/1/2/..

Figure 5. Loop unrolling in GenASM-DC.

Text-Level Parallelism.

In addition to the parallelismenabled by removing the loop dependencies, we enableGenASM-DC algorithm to exploit text-level parallelism. Thisparallelism is enabled by dividing the text into overlappingsub-texts and searching the query in each of these sub-textsin parallel. The overlap ensures that we do not miss any pos-sible match that may fall around the edges of a sub-text. Toguarantee this, the overlap needs to be of length m + k , where m is the query length and k is the edit distance threshold.

6. GenASM-TB Algorithm

After finding the matching location of the text and the editdistance with GenASM-DC, our new traceback [14, 51, 62,63, 117, 120, 154, 163, 168, 169] algorithm, GenASM-TB, findsthe sequence of matches, substitutions, insertions and dele-tions, along with their positions (i.e., CIGAR string) for theatched region (i.e., the text region that starts from the loca-tion reported by GenASM-DC and has a length of m + k ), andreports the optimal alignment. Traceback execution (1) startsfrom the first character of the matched region between thereference text and query pattern, (2) examines each char-acter and decides which of the four operations should bepicked in each iteration, and (3) ends when we reach thelast character of the matched region. GenASM-TB uses theintermediate bitvectors generated and saved in each itera-tion of the GenASM-DC algorithm (i.e., match, substitution,deletion and insertion bitvectors generated in Lines 15–18in Algorithm 1). After a value 0 is found at the MSB of oneof the R [ d ] bitvectors (i.e., a string match is found with d errors), GenASM-TB walks through the bitvectors back tothe LSB, following a chain of 0s (which indicate matchesat each location) and reverting the bitwise operations. Ateach position, based on which of the four bitvectors holdsa value 0 in each iteration (starting with an MSB with a 0and ending with an LSB with a 0), the sequence of matches,substitutions, insertions and deletions (i.e., traceback output)is found for each position of the corresponding alignmentfound by GenASM-DC. Unlike GenASM-DC, GenASM-TBhas an irregular control flow within the stored intermediatebitvectors, which depends on the text and the pattern.Algorithm 2 shows the GenASM-TB algorithm and Figure 6shows an example for the execution of the algorithm foreach of the alignments found in – of Figure 3. In Fig-ure 6, < x , y , z > stands for patternI , textI and curError ,respectively (Lines 6–8 in Algorithm 2). patternI repre-sents the position of a 0 currently being processed withina given bitvector (i.e., pattern index), textI represents theouter loop iteration index (i.e., text index; i in Algorithm 1),and curError represents the inner loop iteration index (i.e.,number of remaining errors; d in Algorithm 1).When we find a 0 at match[textI][curError][patternI] (i.e., a match (M) is found for the current position; Line 17),one character each from both text and query is consumed,but the number of remaining errors stays the same. Thus, thepointer moves to the next text character (as the text characteris consumed), and the 0 currently being processed (high-lighted with orange color in Figure 6) is right-shifted by one(as the query character is also consumed). In other words, textI is incremented (Line 28), patternI is decremented(Line 30), but curError remains the same. Thus, < x , y , z >becomes < x – 1, y + 1, z > after we find a match. For example,in Figure 6a, for Text[0], we have <3, 0, 1> for the indices, andafter the match is found, at the next position (Text[1]), wehave <2, 1, 1>.When we find a 0 at subs[textI][curError][patternI] (i.e., a substitution (S) is found for the current position;Line 19), one character each from both text and query is con-sumed, and the number of remaining errors is decremented(Line 26). Thus, < x , y , z > becomes < x – 1, y + 1, z – 1> afterwe find a substitution (e.g., Text[1] in Figure 6b).When we find a 0 at ins[textI][curError][patternI] (i.e.,an insertion (I) is found for the current position; Lines 13and 21), the inserted character does not appear in the text,and only a character from the pattern is consumed. The 0currently being processed is right-shifted by one, but thetext pointer remains the same, and the number of remainingerrors is decremented. Thus, < x , y , z > becomes < x –1, y , z –1>after we find an insertion (e.g., Text[–] in Figure 6c).When we find a 0 at del[textI][curError][patternI] (i.e.,a deletion (D) is found for the current position; Lines 15 and23), the deleted character does not appear in the pattern, andonly a character from the text is consumed. The 0 currentlybeing processed is not right-shifted, but the pointer moves to Algorithm 2

GenASM-TB Algorithm

Inputs: text (reference), n , pattern (query), m , W (window size), O (overlap size) Output:

CIGAR (complete traceback output)1: ← <0,0> . start positions of sub-pattern and sub-text2: while (curPattern < m) & (curText < n) do sub-pattern ← pattern[curPattern:(curPattern+W)] sub-text ← text[curText:(curText+W)] intermediate bitvectors ← GenASM-DC(sub-pattern,sub-text,W) patternI ← W-1 . pattern index (position of 0 being processed)7: textI ← . text index8: curError ← editDist from GenASM-DC . number of remaining errors9: ← <0,0> prev ← "" . output of previous TB iteration11: while textConsumed<(W-O) & patternConsumed<(W-O) do status ← if ins[textI][curError][patternI]=0 & prev=’I’ status ←

3; add "I" to CIGAR; . insertion-extend15: else if del[textI][curError][patternI]=0 & prev=’D’ status ←

4; add "D" to CIGAR; . deletion-extend17: else if match[textI][curError][patternI]=0 status ←

1; add "M" to CIGAR; prev ← "M" . match19: else if subs[textI][curError][patternI]=0 status ←

2; add "S" to CIGAR; prev ← "S" . substitution21: else if ins[textI][curError][patternI]=0 status ←

3; add "I" to CIGAR; prev ← "I" . insertion-open23: else if del[textI][curError][patternI]=0 status ←

4; add "D" to CIGAR; prev ← "D" . deletion-open25: if (status > 1) curError-- . S, D, or I27: if (status > 0) && (status != 3) textI++; textConsumed++ . M, S, or D29: if (status > 0) && (status != 4) patternI--; patternConsumed++ . M, S, or I31: curPattern ← curPattern+patternConsumed curText ← curText+textConsumed Deletion Example (Text Location=0)Text[0]: C Text[1]: G Text[2]: T Text[3]: G Text[4]: AMatch(C) Del(–) Match(T) Match(G) Match(A)<3,0,1> <2,1,1> <2,2,0> <1,3,0> <0,4,0>R0- : ....

R1-M : R0- : ....

R1-D : 1 R0-M : R1- : ....

R0-M : R1- : ....

R0-M : : ....

Substitution Example (Text Location=1)Text[1]: G Text[2]: T Text[3]: G Text[4]: ASubs(C) Match(T) Match(G) Match(A)<3,1,1> <2,2,0> <1,3,0> <0,4,0>R0- : ....

R1-S : R0-M : R1- : ....

R0-M : R1- : ....

R0-M : : ....

Insertion Example (Text Location=2)Text[–] Text[2]: T Text[3]: G Text[4]: AIns(C) Match(T) Match(G) Match(A)<3,2,1> <2,2,0> <1,3,0> <0,4,0>R0- : ....

R1-I : R0-M : R1- : ....

R0-M : R1- : ....

R0-M : : ....

Figure 6. Traceback example with GenASM-TB algorithm. the next text character, and the number of remaining errorsis also decremented. Thus, < x , y , z > becomes < x , y + 1, z – 1>after we find an insertion (e.g., Text[1] in Figure 6a). Divide-and-Conquer Approach.

Since GenASM-DCstores all of the intermediate bitvectors, in the worst case,the length of the text region that the query pattern maps tocan be m + k , assuming all of the errors are deletions fromthe pattern. Since we need to store all of the bitvectors for m + k characters, and compute 4 × k many bitvectors withineach text iteration (each m bits long), for long reads with higherror rates, the memory requirement becomes ~80GB, whenm is 10,000 and k is 1,500.In order to decrease the memory footprint of the algorithm,we follow two key ideas. First, we apply a divide-and-conquerapproach (similar to the tiling approach of Darwin’s align-ment accelerator, GACT [162]). Instead of storing all of thebitvectors for m + k text characters, we divide the text and pat-tern into overlapping windows (i.e., sub-text and sub-pattern;Lines 3–4 in Algorithm 2) and perform the traceback com-putation for each window. After all of the windows’ partialtraceback outputs are generated, we merge them to find theomplete traceback output. This approach helps us to de-crease the memory footprint from (( m + k ) × × k × m )bits to ( W × × W × W ) bits, where W is the window size.This divide-and-conquer approach also helps us to reducethe complexity of the bitvector generation step (Section 5)from d mw e × n × k to d Ww e × W × W . Second, instead ofstoring all 4 bitvectors (i.e., match, substitution, insertion,deletion) separately, we only need to store bitvectors formatch, insertion, and deletion, as the substitution bitvectorcan be obtained easily by left-shifting the deletion bitvectorby 1 (Line 16 in Algorithm 1). This modification helps usto decrease the required write bandwidth and the memoryfootprint to ( W × × W × W ) bits.GenASM-TB restricts the number of consumed charactersfrom the text or the pattern to W-O (Line 11 in Algorithm 2)to ensure that consecutive windows share O characters (i.e.,overlap size), and thus, the traceback output can be generatedaccurately. The sub-text and the sub-pattern correspondingto each window are found using the number of consumed textcharacters ( textConsumed ) and the number of consumed pat-tern characters ( patternConsumed ) in the previous window(Lines 31–32 in Algorithm 2). Partial Support for Complex Scoring Schemes.

Weextend the GenASM-TB algorithm to provide partial sup-port (Section 10.2) for non-unit costs for different edits andthe affine gap penalty model [14, 62, 117, 168]. By changingthe order in which different traceback cases are checked inLines 13–24 in Algorithm 2, we can support different typesof scoring schemes. For example, in order to mimic the be-havior of the affine gap penalty model, we check whetherthe traceback output that has been chosen for the previousposition (i.e., prev ) is an insertion or a deletion. If the pre-vious edit is a gap (insertion or deletion), and there is a 0at the current position of the insertion or deletion bitvector(Lines 13 and 15 in Algorithm 2), then we prioritize extendingthis previously opened gap, and choose insertion-extend ordeletion-extend as the current position’s traceback output, de-pending on the type of the previous gap. As another example,in order to mimic the behavior of non-unit costs for differ-ent edits, we can simply sort three error cases (substitution,insertion-open, deletion-open) from the lowest penalty to thehighest penalty. If substitutions have a lower penalty thangap openings, the order shown in Algorithm 2 should remainthe same. However, if substitutions have a greater penaltythan gap openings, we should check for the substitution caseafter checking the insertion-open and deletion-open cases(i.e., Lines 19–20 should come after Line 24 in Algorithm 2).

7. GenASM Hardware Design

GenASM-DC Hardware.

We implement GenASM-DC asa linear cyclic systolic array [93, 94] based accelerator. Theaccelerator is optimized to reduce both the memory band-width and the memory footprint. Feedback logic enablingcyclic systolic behavior allows us to fix the required numberof memory ports [93] and to reduce memory footprint.A GenASM-DC accelerator consists of a processing block(PB; Figure 7a) along with a control and memory management logic. A PB consists of multiple processing elements (PEs).Each PE contains a single processing core (PC; Figure 7b) andflip-flop-based storage logic. The PC is the primary computeunit, and implements Lines 15–19 of Algorithm 1 to performthe approximate string matching for a w -bit query pattern.The number of PEs in a PB is based on compute, area, memorybandwidth and power requirements. This block also imple-ments the logic to load data from outside of the array (i.e.,DC-SRAM; Figure 7a) or internally for cyclic operations.GenASM-DC uses two types of SRAM buffers (Figure 7a):(1) DC-SRAM, which stores the reference text, the patternbitmasks for the query read, and the intermediate data gener-ated from PEs (i.e., oldR values and MSBs required for shifts;Section 5); and (2) TB-SRAM, which stores the intermediatebitvectors from GenASM-DC for later use by GenASM-TB.For a 64-PE configuration with 64 bits of processing per PE,and for the case where we have a long (10Kbp) read witha high error rate (15%) and a corresponding text region of11.5Kbp, GenASM-DC requires a total of 8KB DC-SRAM stor-age. For each PE, we have a dedicated TB-SRAM, which storesthe match, insertion and deletion bitvectors generated by thecorresponding PE. For the same configuration of GenASM-DC, each PE requires a total of 1.5KB TB-SRAM storage, witha single R/W port. In each cycle, 192 bits of data (24B) iswritten to each TB-SRAM by each PE.When each thread (i.e., each column) in Figure 5 is mappedto a PE, GenASM-DC coordinates the data dependenciesacross DC iterations, with the help of two flip-flops in eachPE. For example, T2–R2 in Figure 5 is generated by PE x in Cycle y , and is mapped to R [ d ]. In order to generate T2–R2,T2–R1 (which maps to R [ d – 1]) needs to be generated by PE x –1 in Cycle y –1 ( in Figure 7), T1–R1 (which maps to oldR [ d – 1]) needs to be generated by PE x –1 in Cycle y –2 ( ),and T1–R2 (which maps to oldR [ d ]) needs to be generatedby PE x in Cycle y –1 ( ), where x is the PE index and y is thecycle index. With this dependency-aware mapping, regard-less of the number of instantiated PEs, we can successfullylimit DC-SRAM traffic for a single PB to only one read andone write per cycle. GenASM-TB Hardware.

After GenASM-DC finisheswriting all of the intermediate bitvectors to TB-SRAMs,GenASM-TB reads them by following an irregular controlflow, which depends on the text and the pattern to find theoptimal alignment (by implementing Algorithm 2).In our GenASM configuration, where we have 64 PEs and64 bits per PE in a GenASM-DC accelerator, and the win-dow size ( W ) is 64 (Section 6), we have one 1.5KB TB-SRAM(which fits our 24B/cycle ×

64 cycles/window output storagerequirement) for each of the 64 PEs. As Figure 8 shows, asingle GenASM-TB accelerator is connected to all of these64 TB-SRAMs (96KB, in total). In each GenASM-TB cycle,we read from only one TB-SRAM. curError provides the Although we use 10Kbp-long reads in our analysis (Section 9), GenASMdoes not have any limitation on the length of reads as a result of our divide-and-conquer approach (Section 6). (a) Processing Block (PB), DC-SRAM and TB-SRAMs for each PE (b) Processing Core (PC)

TB-SRAM p-1

OldRoutPM out OldRoutPMoutOldR in

Intermediate Bitvectors

PM in

PCPE PCPE PCPE p-1

PCPE p TB-SRAM p TB-SRAM TB-SRAM DC-SRAM

OldR[d-1] <<<<<<

R[d-1]OldR[d]PatternMask MatchR[d]SubstitutionInsertionDeletion

12 3

Figure 7. Hardware design of GenASM-DC. ndex of the TB-SRAM that we read from; textI provides thestarting index within this TB-SRAM, which we read the nextset of bitvectors from; and patternI provides the position ofthe 0 being processed (Algorithm 2).We implement the GenASM-TB hardware using very sim-ple logic (Figure 8), which reads the bitvectors from one ofthe TB-SRAMs using the computed address, performs therequired bitwise comparisons to find the CIGAR characterfor the current position, and computes the next TB-SRAMaddress to read the new set of bitvectors. After GenASM-TBfinds the complete CIGAR string, it writes the output to mainmemory and completes its execution. PE PE PE GenASM-DC

RdWrBitwise Comparisons

CIGAR stringLast CIGAR curError << match CIGARout insertiondeletionsubs

64 6464 64 to main memory GenASM-TB patternI textI

Next Rd AddrCompute Figure 8. Hardware design of GenASM-TB.

Overall System.

We design our system to take advantageof modern 3D-stacked memory systems [58, 92], such as theHybrid Memory Cube (HMC) [76] or High-Bandwidth Mem-ory (HBM) [86, 99]. Such memories are made up of multiplelayers of DRAM arrays that are stacked vertically in a singlepackage. These layers are connected via high-bandwidth linkscalled through-silicon vias (TSVs) that provide lower-latencyand more energy-efficient data access to the layers than theexternal DRAM I/O pins [39, 99]. Memories such as HMCand HBM include a dedicated logic layer that connects tothe TSVs and allows processing elements to be implementedin memory to exploit the efficient data access. Due to ther-mal and area constraints, only simple processing elementsthat execute low-complexity operations (e.g., bitwise logic,simple arithmetic, simple cores) can be included in the logiclayer [3, 4, 23, 24, 43, 56, 72, 73, 91, 119, 137].We decide to implement GenASM in the logic layer of 3D-stacked memory, for two reasons. First, we can exploit thenatural subdivision within 3D-stacked memory (e.g., vaultsin HMC [76], pseudo-channels in HBM [86]) to efficiently en-able parallelism across multiple GenASM accelerators. Thissubdivision allows accelerators to work in parallel withoutinterfering with each other. Second, we can reduce the powerconsumed for DRAM accesses by reducing off-chip data move-ment across the memory channel [119]. Both of our hardwareaccelerators are highly efficient in terms of area and power(Section 10.1), and can fit within the logic layer’s constraints.To illustrate how GenASM takes advantage of 3D-stackedmemory, we discuss an example implementation of GenASMinside the logic layer of a 16GB HMC with 32 vaults [76].Within each vault, the logic layer contains a GenASM-DCaccelerator, its associated DC-SRAM (8KB), a GenASM-TBaccelerator, and TB-SRAMs (64 ×

8. GenASM Framework

We demonstrate the efficiency and flexibility of theGenASM acceleration framework by describing three usecases of approximate string matching in genome sequenceanalysis: (1) read alignment step of short and long read map-ping, (2) pre-alignment filtering for short reads, and (3) editdistance calculation between any two sequences. We believethe GenASM framework can be useful for many other usecases, and we discuss some of them briefly in Section 11.

Read Alignment of Short and Long Reads.

As we ex-plain in Section 2.1, read alignment is the last step of shortand long read mapping. In read alignment, all of the remain-ing candidate mapping regions of the reference genome andthe query reads are aligned, in order to identify the mappingthat yields either the lowest total number of errors (if usingedit distance based scoring) or the highest score (if usinga user-defined scoring function). Thus, read alignment canbe a use case for approximate string matching, since errors(i.e., substitutions, insertions, deletions) should be taken intoaccount when aligning the sequences. As part of read align-ment, we also need to generate the traceback output for thebest alignment between the reference region and the read.For read alignment, the whole GenASM pipeline, as ex-plained in Section 4, should be executed, including the trace-back step. In general, read alignment requires more complexscoring schemes, where different types of edits have non-unitcosts. Thus, GenASM-TB should be configured based on thegiven cost of each type of edit (Section 6). As GenASM frame-work can work with arbitrary length sequences, we can useit to accelerate both short read and long read alignment.

Pre-Alignment Filtering for Short Reads.

In the pre-alignment filtering step of short read mapping, the candidatemapping locations, reported by the seeding step, are fur-ther filtered by using different mechanisms. Although theregions of the reference at these candidate mapping loca-tions share common seeds with query reads, they are notnecessarily similar sequences. To avoid examining dissimi-lar sequences at the downstream computationally-expensiveread alignment step, a pre-alignment filter estimates the editdistance between every read and the regions of the referenceat each read’s candidate mapping locations, and uses thisestimation to quickly decide whether or not read alignmentis needed. If the sequences are dissimilar enough, significantamount of time is saved by avoiding the expensive alignmentstep [9, 10, 13, 176, 177].In pre-alignment filtering, since we only need to estimate(approximately) the edit distance and check whether it isabove a user-defined threshold, GenASM-DC can be used asa pre-alignment filter. As GenASM-DC is very efficient whenwe have shorter sequences and a low error threshold (due tothe O ( m × n × k ) complexity of the underlying Bitap algo-rithm, where m is the query length, n is the reference length,and k is the number of allowed errors), GenASM frameworkcan efficiently accelerate the pre-alignment filtering step ofespecially short read mapping. Edit Distance Calculation.

Edit distance, also called Lev-enshtein distance [100], is the minimum number of edits (i.e.,substitutions, insertions and deletions) required to convertone sequence to another. Edit distance calculation is one ofthe fundamental operations in genomics to measure the simi-larity or distance between two sequences [155]. As we explainin Section 2.3, the Bitap algorithm, which is the underlyingalgorithm of GenASM-DC, is originally designed for edit dis-tance calculation. Thus, GenASM framework can accelerate Although we believe that GenASM can also be used as a pre-alignmentfilter for long reads, we leave the evaluation of this use case for future work. dit distance calculation between any two arbitrary-lengthgenomic sequences.Although GenASM-DC can find the edit distance by itselfand traceback is optional for this use case, DC-TB interactionis required in our accelerator to exploit the efficient divide-and-conquer approach GenASM follows. Thus, GenASM-DCand GenASM-TB work together to find the minimum editdistance in a fast and memory-efficient way, but the tracebackoutput is not generated or reported by default (though it canoptionally be enabled).

9. Evaluation Methodology

Area and Power Analysis.

We synthesize and place &route the GenASM-DC and GenASM-TB accelerator data-paths using the Synopsys Design Compiler [156] with a typi-cal 28nm low-power process, with memories generated usingan industry-grade SRAM compiler, to analyze the acceler-ators’ area and power. Our synthesis targets post-routingtiming closure at 1GHz clock frequency. We then use anin-house cycle-accurate simulator parameterized with thesynthesis and memory estimations to drive the performanceand power analysis.We evaluate a 16GB HMC-like 3D-stacked DRAM archi-tecture, with 32 vaults [76] and 256GB/s of internal band-width [23, 76], and a clock frequency of 1.25GHz [76]. Theamount of available area in the logic layer for GenASM isaround 3.5–4.4 mm per vault [23, 43]. The power budget ofour PIM logic per vault is 312mW [43]. Performance Model.

We build a spreadsheet-based ana-lytical model for GenASM-DC and GenASM-TB, which con-siders reference genome (i.e., text) length, query read (i.e.,pattern) length, maximum edit distance, window size, hard-ware design parameters (number of PEs, bit width of each PE)and number of vaults as input parameters and projects com-pute cycles, DRAM read/write bandwidth, SRAM read/writebandwidth, and memory footprint. We verify the analytically-estimated cycle counts for various PE configurations with thecycle counts collected from our RTL simulations.

Read Alignment Comparisons.

For the read alignmentuse case, we compare GenASM with the read alignment stepsof two commonly-used state-of-the-art read mappers: Min-imap2 [102] and BWA-MEM [101], running on an Intel ® Xeon ® Gold 6126 CPU [80] operating at 2.60GHz, with 64GBDDR4 memory. Software baselines are run with a singlethread and with 12 threads. We measure the execution timeand power consumption of the alignment steps in Minimap2and BWA-MEM. We measure the individual power consumedby each tool using Intel’s PCM power utility [81].We also compare GenASM with a state-of-the-art GPU-accelerated short read alignment tool, GASAL2 [2]. We runGASAL2 on an Nvidia Titan V GPU [129] with 12GB HBM2memory [86]. To fully utilize the GPU, we configure thenumber of alignments per batch based on the GPU’s numberof multiprocessors and the maximum number of threads permultiprocessor, as described in the GASAL2 paper [2]. Tobetter analyze the high parallelism that the GPU provides,we replicate our datasets to obtain datasets with 100K, 1Mand 10M reference-read pairs for short reads. We run thedatasets with GASAL2, and collect kernel time and averagepower consumption using nvprof [130].We also compare GenASM with two state-of-the-arthardware-based alignment accelerators, GACT of Darwin[162] and SillaX of GenAx [55]. We synthesize and executethe open-source RTL for GACT [161]. We estimate the perfor-mance of SillaX using data reported by the original work [55].We analyze the alignment accuracy of GenASM by compar-ing the alignment outputs (i.e., alignment score, edit distance,and CIGAR string) of GenASM with the alignment outputs of BWA-MEM and Minimap2, for short reads and long reads,respectively. We obtain the BWA-MEM and Minimap2 align-ments by running the tools with their default settings.

Pre-Alignment Filtering Comparisons.

We compareGenASM with Shouji [9], which is the state-of-the-art FPGA-based pre-alignment filter for short reads. For execution timeand filtering accuracy analyses, we use data reported by theoriginal work [9]. For power analysis, we report the totalpower consumption of Shouji using the power analysis toolin Xilinx Vivado [175], after synthesizing and implementingthe open-source FPGA design of Shouji [149].

Edit Distance Calculation Comparisons.

We compareGenASM with the state-of-the-art software-based read align-ment library, Edlib [155], running on an Intel ® Xeon ® Gold6126 CPU [80] operating at 2.60GHz, with 64GB DDR4 mem-ory. Edlib uses the Myers’ bitvector algorithm [121] to findthe edit distance between two sequences. We use the defaultglobal Needleman-Wunsch (NW) [126] mode of Edlib to per-form our comparisons. We measure the power consumed byEdlib using Intel’s PCM power utility [81].We also compare GenASM with ASAP [22], which is thestate-of-the-art FPGA-based accelerator for computing theedit distance between two short reads. We estimate the perfor-mance of ASAP using data reported by the original work [22].

Datasets.

For the read alignment use case, we evaluateGenASM using the latest major release of the human genomeassembly, GRCh38 [124]. We use the 1–22, X, and Y chromo-somes by filtering the unmapped contigs, unlocalized contigs,and mitochondrial genome. Genome characters are encodedinto 2-bit patterns (A = 00, C = 01, G = 10, T = 11). With thisencoding, the reference genome uses 715 MB of memory.We generate four sets of long reads (i.e., PacBio and ONTdatasets) using PBSIM [131] and three sets of short reads (i.e.,Illumina datasets) using Mason [71]. For the PacBio datasets,we use the default error profile for the continuous long reads(CLR) in PBSIM. For the ONT datasets, we modify the settingsto match the error profile of ONT reads sequenced using R9.0chemistry [84]. Both datasets have 240,000 reads of length10Kbp, each simulated with 10% and 15% error rates. TheIllumina datasets have 200,000 reads of length 100bp, 150bp,and 250bp, each simulated with a 5% error rate.For the pre-alignment filtering use case, we use twodatasets that Shouji [9] provides as test cases: reference-readpairs (1) of length 100bp with an edit distance threshold of 5,and (2) of length 250bp with an edit distance threshold of 15.For the edit distance calculation use case, we use thepublicly-available dataset that Edlib [155] provides. Thedataset includes two real DNA sequences, which are 100Kbpand 1Mbp in length, and artificially-mutated versions of theoriginal DNA sequences with measures of similarity rangingbetween 60%–99%. Evaluating this set of sequences with vary-ing values of similarity and length enables us to demonstratehow these parameters affect performance.

10. Results

Table 1 shows the area and power breakdown of each com-ponent in GenASM, and the total area overhead and powerconsumption of (1) a single GenASM accelerator (in 1 vault)and (2) 32 GenASM accelerators (in 32 vaults). Both GenASM-DC and GenASM-TB operate at 1GHz.The area overhead of one GenASM accelerator is0.334 mm , and the power consumption of one GenASM accel-erator, including the SRAM power, is 101 mW. When we com-pare GenASM with a single core of a modern Intel ® Xeon ® Gold 6126 CPU [80] (which we conservatively estimate touse 10.4 W [80] and 32.2 mm [36] per core), we find thatenASM is significantly more efficient in terms of both areaand power consumption. As we have one GenASM acceler-ator per vault, the total area overhead of GenASM in all 32vaults is 10.69 mm . Similarly, the total power consumptionof 32 GenASM accelerators is 3.23 W. Table 1. Area and power breakdown of GenASM.

Component Area (mm ) Power (W) GenASM-DC (64 PEs)

GenASM-TB

DC-SRAM (8 KB)

TB-SRAMs (64 x 1.5 KB)

Total − 1 vault (32 vaults)

Software Baselines (CPU).

Figure 9 shows the read align-ment throughput (reads/sec) of GenASM and the alignmentsteps of BWA-MEM and Minimap2, when aligning long noisyPacBio and ONT reads against the human reference genome.When comparing with BWA-MEM, we run GenASM with thecandidate locations reported by BWA-MEM’s filtering step.Similarly, when comparing with Minimap2, we run GenASMwith the candidate locations reported by Minimap2’s filteringstep. GenASM’s throughput is determined by the through-put of the execution of GenASM-DC and GenASM-TB withwindow size ( W ) of 64 and overlap size ( O ) of 24.As Figure 9 shows, GenASM provides (1) 7173 × and 648 × throughput improvement over the alignment step of BWA-MEM for its single-thread and 12-thread execution, respec-tively, and (2) 1126 × and 116 × throughput improvementover the alignment step of Minimap2 for its single-thread and12-thread execution, respectively. T h r o u g h p u t ( r e a d s / s e c ) BWA-MEM (t=1) BWA-MEM (t=12) GenASM (w/ BWA-MEM)Minimap2 (t=1) Minimap2 (t=12) GenASM (w/ Minimap2)648 𝑥 𝑥 Figure 9. Throughput comparison of GenASM and the align-ment steps of BWA-MEM and Minimap2 for long reads.

Based on our power analysis with long reads, we find thatpower consumption of BWA-MEM’s alignment step is 58.6 Wand 109.5 W, and power consumption of Minimap2’s readalignment step is 59.8 W and 118.9 W for their single-threadand 12-thread executions, respectively. GenASM consumesonly 3.23W, and thus reduces the power consumption of thealignment steps of BWA-MEM and Minimap2 by 18 × and19 × for single-thread execution, and by 34 × and 37 × for12-thread execution, respectively.Figure 10 compares the read alignment throughput(reads/sec) of GenASM with that of the alignment steps ofBWA-MEM and Minimap2, when aligning short Illuminareads against the human reference genome. GenASM pro-vides (1) 1390 × and 111 × throughput improvement overthe alignment step of BWA-MEM for its single-thread and12-thread execution, respectively, and (2) 1839 × and 158 × T h r o u g h p u t ( r e a d s / s e c ) BWA-MEM (t=1) BWA-MEM (t=12) GenASM (w/ BWA-MEM)Minimap2 (t=1) Minimap2 (t=12) GenASM (w/ Minimap2)111 𝑥 𝑥 Figure 10. Throughput comparison of GenASM and the align-ment steps of BWA-MEM and Minimap2 for short reads. throughput improvement over the alignment step of Min-imap2 for its single-thread and 12-thread execution.Based on our power analysis with short reads, we findthat GenASM reduces the power consumption over the align-ment steps of BWA-MEM and Minimap2 by 16 × and 18 × forsingle-thread execution, and by 33 × and 31 × for 12-threadexecution, respectively.Figure 11 shows the total execution time of the entire BWA-MEM and Minimap2 pipelines, along with the total executiontime when the alignment steps of each pipeline are replacedby GenASM, for the three representative input datasets. AsFigure 11 shows, GenASM provides (1) 2.4 × and 1.9 × speedupfor Illumina reads (250bp); (2) 6.5 × and 3.4 × speedup forPacBio reads (15%); and (3) 4.9 × and 2.1 × speedup for ONTreads (15%), over the entire pipeline executions of BWA-MEMand Minimap2, respectively. E x e c u t i o n t i m e ( s e c ) BWA-MEM GenASM (w/ BWA-MEM) Minimap2 GenASM (w/ Minimap2)2.4 𝑥 𝑥 𝑥 𝑥 𝑥 𝑥 Figure 11. Total execution time of the entire BWA-MEM andMinimap2 pipelines with and without GenASM.

Software Baselines (GPU).

We compare GenASM withthe state-of-the-art GPU aligner, GASAL2 [2], using threedatasets of varying size (100K, 1M, and 10M reference-readpairs). Based on our analysis, we make three findings. First,for 100bp Illumina reads, GenASM provides 9.9 × , 9.2 × , and8.5 × speedup over GASAL2, while reducing the power con-sumption by 15.6 × , 17.3 × and 17.6 × for 100K, 1M, and 10Mdatasets, respectively. Second, for 150bp Illumina reads,GenASM provides 15.8 × , 13.1 × , and 13.4 × speedup overGASAL2, while reducing the power consumption by 15.4 × ,18.0 × , and 18.7 × for 100K, 1M, and 10M datasets, respec-tively. Third, for 250bp Illumina reads, GenASM provides21.5 × , 20.6 × , and 21.1 × speedup over GASAL2, while re-ducing the power consumption by 16.8 × , 20.2 × , and 20.6 × for 100K, 1M, and 10M datasets, respectively. We concludethat GenASM provides significant performance benefits andenergy efficiency over GPU aligners for short reads. Hardware Baselines.

We compare GenASM with twostate-of-the-art hardware accelerators for read alignment:GACT (from Darwin [162]) and SillaX (from GenAx [55]).Darwin is a hardware accelerator designed for long readalignment [162]. Darwin contains components that acceler-ate both the filtering (D-SOFT) and alignment (GACT) stepsof read mapping. The open-source RTL code available forthe GACT accelerator of Darwin allows us to estimate thethroughput, area and power consumption of GACT and com-pare it with GenASM for read alignment. In Darwin, GACTlogic and the associated 128KB SRAM are responsible for fill-ing the dynamic programming matrix, generating the trace-back pointers and finding the maximum score. Thus, webelieve that it is fair to compare the power consumption andthe area of the GACT logic and GenASM logic, along withtheir associated SRAMs.In order to have an iso-bandwidth comparison with Dar-win’s GACT, we compare only a single array of GACTand a single set of GenASM-DC and GenASM-TB, because(1) GenASM utilizes the high memory bandwidth that PIMprovides only to parallelize many sets of GenASM-DC andGenASM-TB, and a single set of GenASM-DC and GenASM-TB does not require high bandwidth, and (2) all internal dataof both GenASM and Darwin is provided by local SRAMs. Wesynthesize both designs (i.e., GenASM and GACT) at an iso-VT (process, voltage, temperature) corner, with the samenumber of PEs, and with their optimum parameters.As Figure 12 shows, for a single GACT array with 64 PEs at1GHz, the throughput of GACT decreases from 55,556 to 6,289alignments per second when the sequence length increasesfrom 1Kbp to 10Kbp, while consuming 277.7 mW of power. Incomparison, for a single GenASM accelerator at 1GHz (with a64-PE configuration), the throughput decreases from 236,686to 23,669 alignments per second when the sequence lengthincreases from 1Kbp to 10Kbp, while consuming 101 mW ofpower. This shows that, on average, GenASM provides 3.9 × better throughput than GACT, while consuming 2.7 × lesspower. Thus, GenASM provides 10.5 × better throughput perunit power for long reads when compared to GACT. T h r o u g h p u t ( r e a d s / s e c ) GACT (Darwin) GenASM 3.9 𝑥 Figure 12. Throughput comparison of GenASM and GACTfrom Darwin for long reads.

As Figure 13 shows, we also compare the throughput ofGenASM and GACT for short read alignment (i.e., 100–300bpreads). We find that GenASM performs 7.4 × better thanGACT when aligning short reads, on average. Thus, GenASMprovides 20.0 × better throughput per unit power for shortreads when compared to GACT. T h r o u g h p u t ( r e a d s / s e c ) GACT (Darwin) GenASM 7.4 𝑥 Figure 13. Throughput comparison of GenASM and GACTfrom Darwin for short reads.

We compare the required area for the GACT logic with128KB of SRAM and the required area for the GenASM logic(GenASM-DC and GenASM-TB) with 8KB of DC-SRAM and96KB of TB-SRAMs, at 28nm. We find that GenASM requires1.7 × less area than GACT. Thus, GenASM provides 6.6 × and12.6 × better throughput per unit area for long reads and forshort reads, respectively, when compared to GACT.The main difference between GenASM and GACT is the un-derlying algorithms. GenASM uses our modified Bitap algo-rithm, which requires only simple and fast bitwise operations.On the other hand, GACT uses the complex and computation-ally more expensive dynamic programming based algorithmfor alignment. This is the main reason why GenASM is moreefficient than GACT of Darwin.GenAx is a hardware accelerator designed for short readalignment [55]. Similar to Darwin, GenAx accelerates boththe filtering and alignment steps of read mapping. UnlikeGenAx, whose design is optimized only for short reads,GenASM is more robust and works with both short andlong reads. While we are unable to reimplement GenAx,the throughput analysis of SillaX (the alignment acceleratorof GenAx) provided by the original work [55] allows us toprovide a performance comparison between GenASM andSillaX for short read alignment.We compare SillaX with GenASM at their optimal oper-ating frequencies (2GHz for SillaX, 1GHz for GenASM), andfind that GenASM provides 1.9 × higher throughput for shortreads (101bp) than SillaX (whose approximate throughputis 50M alignments per second). Using the area and power numbers reported for the computation logic of SillaX, wefind that GenASM requires 63% less logic area (2.08 mm vs.5.64 mm ) and 82% less logic power (1.18 W vs. 6.6 W).In order to compare the total area of SillaX and GenASM,we perform a CACTI-based analysis [172] for the SillaX SRAM(2.02 MB). We find that the SillaX SRAM consumes an areaof 3.47 mm , resulting in a total area of 9.11 mm for Sil-laX. Although GenASM (10.69 mm ) requires 17% more totalarea than SillaX, we find that GenASM provides 1.6 × betterthroughput per unit area for short reads than SillaX. Accuracy Analysis.

We compare the traceback outputs ofGenASM and (1) BWA-MEM for short reads, (2) Minimap2 forlong reads, to assess the accuracy and correctness of GenASM-TB. We find that the optimum ( W , O ) setting (i.e., windowsize and overlap size) for the GenASM-TB algorithm in termsof performance and accuracy is W = 64 and O = 24. Withthis setting, GenASM completes the alignment of all reads ineach dataset, and increasing the window size does not changethe alignment output.For short reads, we use the default scoring setting of BWA-MEM (i.e., match=+1, substitution=-4, gap opening=-6, andgap extension=-1). For 96.6% of the short reads, GenASMfinds an alignment whose score is equal to the score of thealignment reported by BWA-MEM. This fraction increases to99.7% when we consider scores that are within ± ± ± Although GenASM is optimized for unit-cost based scoring(i.e., edit distance) and currently provides only partial supportfor more complex scoring schemes, we show that GenASMframework can still serve as a fast, memory- and power-efficient, and quite accurate alternative for read alignment.

We compare GenASM with the state-of-the-art FPGA-based pre-alignment filter for short reads, Shouji [9], us-ing two datasets provided in [9]. When we compare Shouji(with maximum filtering units) and GenASM for the datasetwith 100bp sequences, we find that GenASM provides 3.7 × speedup over Shouji, while reducing power consumption by1.7 × . When we perform the same analysis with 250bp se-quences, we find that GenASM does not provide speedupover Shouji, but reduces power consumption by 1.6 × . We can add support for different orderings by adding more configura-bility to the GenASM-TB accelerator, which we leave for future work. n pre-alignment filtering for short reads, only GenASM-DC is executed (Section 8). The complexity of GenASM-DCis O ( n × m × k ) whereas the complexity of Shouji is O ( m × k ),where n is the text length, m is the read length, and k is theedit distance threshold. Going from the 100bp dataset to the250bp dataset, all these three parameters increase linearly.Thus, the speedup of GenASM over Shouji for pre-alignmentfiltering decreases for datasets with longer reads.To analyze filtering accuracy, we use Edlib [155] to gener-ate the ground truth edit distance value for each sequencepair in the datasets (similar to Shouji). We evaluate the accu-racy of GenASM as a pre-alignment filter by computing itsfalse accept rate and false reject rate (as defined in [9]).The false accept rate [9] is the ratio of the number of dis-similar sequences that are falsely accepted by the filter (assimilar) and the total number of dissimilar sequences that arerejected by the ground truth. The goal is to minimize the falseaccept rate to maximize the number of dissimilar sequencesthat are eliminated by the filter. For the 100bp dataset withan edit distance threshold of 5, Shouji has a 4% false acceptrate, whereas GenASM has a false accept rate of only 0.02%.For the 250bp dataset with an edit distance threshold of 15,Shouji has a 17% false accept rate, whereas GenASM has afalse accept rate of only 0.002%. Thus, GenASM provides avery low rate of falsely-accepted dissimilar sequences, andsignificantly improves the accuracy of pre-alignment filteringcompared to Shouji.While Shouji approximates the edit distance, GenASM cal-culates the actual distance. Although calculation requiresmore computation than approximation, a computed distanceresults in a near-zero (0.002%) false accept rate. Thus,GenASM filters more false-positive locations out, leavingfewer candidate locations for the expensive alignment stepto process. This greatly reduces the combined execution timeof filtering and alignment. Thus, even though GenASM doesnot provide any speedup over Shouji when filtering the 250bpsequences, its lower false accept rate makes it a better optionfor this step of the pipeline with greater overall benefits.The false reject rate [9] is the ratio of the number of similarsequences that are rejected by the filter (as dissimilar) andthe total number of similar sequences that are accepted bythe ground truth. The false reject rate should always be equalto 0%. We observe that GenASM always provides a 0% falsereject rate, and thus does not filter out similar sequence pairs,as does Shouji.

We compare GenASM with the state-of-the-art edit dis-tance calculation library, Edlib [155]. Figure 14 compares theexecution time of Edlib (with and without finding the trace-back output) and GenASM when finding the edit distancebetween two sequences of length 100Kbp, and also two se-quences of length 1Mbp, which have similarity ranging from60% to 99% (Section 9). Since Edlib is a single-thread editdistance calculation tool, for a fair comparison, we comparethe throughput of only one GenASM accelerator (i.e., in onevault) with a single-thread execution of the Edlib tool.As Figure 14 shows, when performing edit distance cal-culation between two 100Kbp sequences, GenASM provides22–716 × and 146–1458 × speedup over Edlib execution with-out and with traceback, respectively. GenASM has the same The reason for the non-zero false accept rate of GenASM is that whenthere is a deletion in the first character of the query, GenASM does not countthis as an edit, and skips this extra character of the text when computingthe edit distance. Since GenASM reports an edit distance that is one lowerthan the edit distance reported by the ground truth, if GenASM’s reportededit distance is below the threshold but the ground truth’s is not, GenASMleads to a false accept. execution time for both of the cases. When the sequencelength increases from 100Kbp to 1Mbp, the execution time ofGenASM increases linearly (since W is constant, but m + k increases linearly). However, due to its quadratic complexity,Edlib cannot scale linearly. Thus, for the edit distance calcu-lation of 1Mbp sequences, GenASM provides 262–5413 × and627–12501 × speedup over Edlib execution without and withtraceback, respectively.Although both the GenASM algorithm and Edlib’s under-lying Myers’ algorithm [121] use bitwise operations only foredit distance calculation and exploit bit-level parallelism, themain advantages of the GenASM algorithm come from (1) thedivide-and-conquer approach we follow for efficient supportfor longer sequences, and (2) our efficient co-design of theGenASM algorithm with the GenASM hardware accelerator. Figure 14. Execution time comparison of GenASM and Edlibfor edit distance calculation.

Based on our power analysis, we find that power con-sumption of Edlib is 55.3 W and 58.8 W when finding theedit distance between two 100Kbp sequences and two 1Mbpsequences, respectively. Thus, GenASM reduces power con-sumption by 548 × and 582 × over Edlib, respectively.We also compare GenASM with ASAP [22], the state-of-the-art FPGA-based accelerator for edit distance calculation.While we are unable to reimplement ASAP, the executiontime and power consumption analysis of ASAP providedin [22] allows us to provide a comparison between GenASMand ASAP. ASAP is optimized for shorter sequences andreports execution time only for sequences of length 64bp–320bp [22]. Based on [22], the execution time of one ASAPaccelerator increases from 6.8 µs to 18.8 µs when the sequencelength increases from 64bp to 320bp, while consuming 6.8 Wof power. In comparison, we report that the execution time ofone GenASM accelerator increases from 0.017 µs to 2.025 µswhen the sequence length increases from 64bp to 320bp, whileconsuming 0.101 W of power. This shows that GenASM pro-vides 9.3–400 × speedup over ASAP, while consuming 67 × less power. GenASM’s performance improvements come from our al-gorithm/hardware co-design, i.e., both from our modifiedalgorithm and our co-designed architecture for this algo-rithm. The sources of the large improvements in GenASM are(1) the very simple computations it performs; (2) the divide-and-conquer approach we follow, which makes our designefficient for both short and long reads despite their differenterror profiles; and (3) the very high degree of parallelismobtained with the help of specialized compute units, dedi-cated SRAMs for both GenASM-DC and GenASM-TB, andthe vault-level parallelism provided by processing in the logiclayer of 3D-stacked memory.

Algorithm-Level.

Our divide-and-conquer approach al-lows us to decrease the execution time of GenASM-DCfrom ( m × ( m + k ) × kP × w ) cycles to (( W × W × min ( W , k ) P × w ) × m + kW – O ) cycles,where m is the pattern size, k is the edit distance threshold, P is the number of PEs that GenASM-DC has (i.e., 64), w is the number of bits processed by each PE (i.e., 64), W isthe window size (i.e., 64), and O is the overlap size betweenwindows (i.e., 24). Although the total GenASM-TB executionime does not change (( m + k ) cycles vs. (( W – O ) × m + kW – O )cycles), our divide-and-conquer approach helps us decreasethe GenASM-DC execution time by 3662 × for long reads, andby 1.6 – 3.9 × for short reads. Hardware-Level.

GenASM-DC’s systolic-array-based de-sign removes the data dependency limitation of the underly-ing Bitap algorithm, and provides 64 × parallelism by perform-ing 64 iterations of the GenASM-DC algorithm in parallel.Our hardware accelerator for GenASM-TB makes use of spe-cialized per-PE TB-SRAMs, which eliminates the otherwisevery high memory bandwidth consumption of traceback andenables efficient execution. Technology-Level.

With the help of 3D-stacked mem-ory’s vault-level parallelism, we can obtain 32 × parallelismby performing 32 alignments in parallel in different vaults.

11. Other Use Cases of GenASM

We have quantitatively evaluated three use cases of ap-proximate string matching for genome sequence analysis(Section 10). We discuss four other potential use cases ofGenASM, whose evaluation we leave for future work.

Read-to-Read Overlap Finding Step of de Novo As-sembly.

De novo assembly [31] is an alternate genome se-quencing approach that assembles an entire DNA sequencewithout the use of a reference genome. The first step of denovo assembly is to find read-to-read overlaps since the refer-ence genome does not exist [152]. Pairwise read alignment(i.e., read-to-read alignment) is the last step of read-to-readoverlap finding [102, 138]. As sequencing devices can intro-duce errors to the reads, read alignment in overlap findingalso needs to take these errors into account. GenASM can beused for the pairwise read alignment step of overlap finding.

Hash-Table Based Indexing.

In the indexing step of readmapping, the reference genome is indexed and stored as ahash table, whose keys are all possible fixed-length substrings(i.e., seeds) and whose values are the locations of these seedsin the reference genome. This index structure is queried inthe seeding step to find the candidate matching locations ofquery reads. As we need to find the locations of each seed inthe reference text to form the index structure, GenASM canbe used to generate the hash-table based index.

Whole Genome Alignment.

Whole genome alignment[42, 136] is the method of aligning two genomes (from thesame or different species) for predicting evolutionary or famil-ial relationships between these genomes. In whole genomealignment, we need to align two very long sequences. SinceGenASM can operate on arbitrary-length sequences as a re-sult of our divide-and-conquer approach, whole genome align-ment can be accelerated using the GenASM framework.

Generic Text Search.

Although GenASM-DC is opti-mized for genomic sequences (i.e., DNA sequences), which arecomposed of only 4 characters (i.e., A, C, G and T), GenASM-DC can be extended to support larger alphabets, thus enablinggeneric text search. When generating the pattern bitmasksduring the pre-processing step, the only change that is re-quired is to generate bitmasks for the entire alphabet, insteadof for only four characters. There is no change required tothe edit distance calculation step.As special cases of general text search, the alphabet can bedefined as RNA bases (i.e., A, C, G, U) for RNA sequences oras amino acids (i.e., A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P,S, T, W, Y, V) for protein sequences. This enables GenASMto be used for RNA sequence alignment or protein sequencealignment [15, 16, 44, 67, 69, 90, 108, 126, 126, 128, 154, 157, 182].

12. Related Work

To our knowledge, this is the first approximate stringmatching acceleration framework that enhances and acceler- ates the Bitap algorithm, and demonstrates the effectivenessof the framework for multiple use cases in genome sequenceanalysis. Many previous works have attempted to improve (insoftware or in hardware) the performance of a single step ofthe genome sequence analysis pipeline. Recent accelerationworks tend to follow one of two key directions [8].The first approach is to build pre-alignment filters thatuse heuristics to first check the differences between two ge-nomic sequences before using the computationally-expensiveapproximate string matching algorithms. Examples of suchfilters are the Adjacency Filter [177] that is implemented forstandard CPUs, SHD [176] that uses SIMD-capable CPUs, andGRIM-Filter [91] that is built in 3D-stacked memory. Manyworks also exploit the large amounts of parallelism offered byFPGA architectures for pre-alignment filtering, such as Gate-Keeper [10], MAGNET [11], Shouji [9], and SneakySnake [13].A recent work, GenCache [122], proposes an in-cache accel-erator to improve the filtering (i.e., seeding) mechanism ofGenAx (for short reads) by using in-cache operations [1] andsoftware modifications.The second approach is to use hardware accelerators forthe computationally-expensive read alignment step. Ex-amples of such hardware accelerators are RADAR [74],FindeR [181], and AligneR [180], which make use of ReRAMbased designs for faster FM-index search, or RAPID [65] andBioSEAL [88], which target dynamic programming accelera-tion with processing-in-memory. Other read alignment ac-celeration works include SIMD-capable CPUs [38], multicoreCPUs [57, 109], and specialized hardware accelerators suchas GPUs (e.g., GSWABE [109], CUDASW++ 3.0 [110]), FPGAs(e.g., FPGASW [49], ASAP [22]), or ASICs (e.g., Darwin [162]and GenAx [55]).In contrast to GenASM, all of these prior works focus on ac-celerating only a single use case in genome sequence analysis,whereas GenASM is capable of accelerating at least three dif-ferent use cases (i.e., read alignment, pre-alignment filtering,edit distance calculation) where approximate string matchingis required.

13. Conclusion

We propose GenASM, an approximate string matching(ASM) acceleration framework for genome sequence analy-sis built upon our modified and enhanced Bitap algorithm.GenASM performs bitvector-based ASM, which can acceler-ate multiple steps of genome sequence analysis. We co-designour highly-parallel, scalable and memory-efficient algorithmswith low-power and area-efficient hardware accelerators. Weevaluate GenASM for three different use cases of ASM ingenome sequence analysis for both short and long reads:read alignment, pre-alignment filtering, and edit distance cal-culation. We show that GenASM is significantly faster andmore power- and area-efficient than state-of-the-art softwareand hardware tools for each of these use cases. We hope thatGenASM inspires future work in co-designing algorithmsand hardware together to create powerful frameworks thataccelerate other bioinformatics workloads and emerging ap-plications.

Acknowledgments

Part of this work was completed during Damla Senol Cali’sinternship at Intel Labs. This work is supported by fundingfrom Intel, the Semiconductor Research Corporation, theNational Institutes of Health (NIH), the industrial partners ofthe SAFARI Research Group, and partly by EMBO InstallationGrant 2521 awarded to Can Alkan. We thank the anonymousreviewers of MICRO 2019, ASPLOS 2020, ISCA 2020, andMICRO 2020 for their comments. eferences [1] S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, andR. Das, “Compute Caches,” in

HPCA , 2017.[2] N. Ahmed, J. Lévy, S. Ren, H. Mushtaq, K. Bertels, and Z. Al-Ars,“GASAL2: A GPU Accelerated Sequence Alignment Library for High-Throughput NGS Data,”

BMC Bioinformatics , 2019.[3] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing,” in

ISCA , 2015.[4] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-Enabled Instructions: ALow-Overhead, Locality-Aware Processing-in-Memory Architecture,”in

ISCA , 2015.[5] C. Alkan, B. P. Coe, and E. E. Eichler, “Genome Structural VariationDiscovery and Genotyping,”

Nature Reviews Genetics , 2011.[6] C. Alkan, J. M. Kidd, T. Marques-Bonet, G. Aksay, F. Antonacci, F. Hor-mozdiari, J. O. Kitzman, C. Baker, M. Malig, O. Mutlu, S. C. Sahinalp,R. A. Gibbs, and E. E. Eichler, “Personalized Copy Number and Seg-mental Duplication Maps Using Next-Generation Sequencing,”

NatureGenetics , 2009.[7] C. Alkan, S. Sajjadian, and E. E. Eichler, “Limitations of Next-Generation Genome Sequence Assembly,”

Nature Methods , 2011.[8] M. Alser, Z. Bingöl, D. Senol Cali, J. Kim, S. Ghose, C. Alkan, andO. Mutlu, “Accelerating Genome Analysis: A Primer on an OngoingJourney,”

IEEE Micro , 2020.[9] M. Alser, H. Hassan, A. Kumar, O. Mutlu, and C. Alkan, “Shouji: A Fastand Efficient Pre-Alignment Filter for Sequence Alignment,”

Bioinfor-matics , 2019.[10] M. Alser, H. Hassan, H. Xin, O. Ergin, O. Mutlu, and C. Alkan, “Gate-Keeper: A New Hardware Architecture for Accelerating Pre-Alignmentin DNA Short Read Mapping,”

Bioinformatics , 2017.[11] M. Alser, O. Mutlu, and C. Alkan, “MAGNET: Understanding andImproving the Accuracy of Genome Pre-Alignment Filtering,”

TIR ,2017.[12] M. Alser, J. Rotman, K. Taraszka, H. Shi, P. I. Baykal, H. T. Yang,V. Xue, S. Knyazev, B. D. Singer, B. Balliu, D. Koslicki, P. Skums, A. Ze-likovsky, C. Alkan, O. Mutlu, and S. Mangul, “Technology Dictates Al-gorithms: Recent Developments in Read Alignment,” arXiv:2003.00110[q-bio.GN], 2020.[13] M. Alser, T. Shahroodi, J. Gomez-Luna, C. Alkan, and O. Mutlu,“SneakySnake: A Fast and Accurate Universal Genome Pre-AlignmentFilter for CPUs, GPUs, and FPGAs,” arXiv:1910.09020 [q-bio.GN], 2019.[14] S. F. Altschul and B. W. Erickson, “Optimal Sequence Alignment usingAffine Gap Costs,”

Bulletin of Mathematical Biology , 1986.[15] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “BasicLocal Alignment Search Tool,”

Journal of Molecular Biology , 1990.[16] S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller,and D. J. Lipman, “Gapped BLAST and PSI-BLAST: A New Generationof Protein Database Search Programs,”

Nucleic Acids Research , 1997.[17] M. J. Alvarez-Cubero, M. Saiz, B. Martínez-García, S. M. Sayalero,C. Entrala, J. A. Lorente, and L. J. Martinez-Gonzalez, “Next GenerationSequencing: An Application in Forensic Sciences?”

Annals of HumanBiology , 2017.[18] S. L. Amarasinghe, S. Su, X. Dong, L. Zappia, M. E. Ritchie, and Q. Gouil,“Opportunities and Challenges in Long-Read Sequencing Data Analy-sis,”

Genome Biology , 2020.[19] S. Ardui, A. Ameur, J. R. Vermeesch, and M. S. Hestand, “SingleMolecule Real-Time (SMRT) Sequencing Comes of Age: Applicationsand Utilities for Medical Diagnostics,”

Nucleic Acids Research , 2018.[20] E. A. Ashley, “Towards Precision Medicine,”

Nature Reviews Genetics ,2016.[21] R. Baeza-Yates and G. H. Gonnet, “A New Approach to Text Searching,”

CACM , 1992.[22] S. S. Banerjee, M. El-Hadedy, J. B. Lim, Z. T. Kalbarczyk, D. Chen, S. S.Lumetta, and R. K. Iyer, “ASAP: Accelerated Short-Read Alignment onProgrammable Hardware,” TC , 2019.[23] A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu,R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan, and O. Mutlu,“Google Workloads for Consumer Devices: Mitigating Data MovementBottlenecks,” in ASPLOS , 2018.[24] A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, R. Ausavarung-nirun, K. Hsieh, N. Hajinazar, K. T. Malladi, H. Zheng, and O. Mutlu,“CoNDA: Efficient Cache Coherence Support for Near-Data Accelera-tors,” in

ISCA , 2019.[25] C. Børsting and N. Morling, “Next Generation Sequencing and Its Ap-plications in Forensic Genetics,”

Forensic Science International: Genetics ,2015.[26] D. Branton, D. W. Deamer, A. Marziali, H. Bayley, S. A. Benner, T. Butler,M. D. Ventra, S. Garaj, A. Hibbs, X. Huang, S. B. Jovanovich, P. S. Krstic,S. Lindsay, X. S. Ling, C. H. Mastrangelo, A. Meller et al. , “The Potentialand Challenges of Nanopore Sequencing,”

Nature Biotechnology , 2008.[27] N. Bray, I. Dubchak, and L. Pachter, “AVID: A Global Alignment Pro-gram,”

Genome Research , 2003.[28] M. Brudno, C. B. Do, G. M. Cooper, M. F. Kim, E. Davydov, NISC Com-parative Sequencing Program, E. D. Green, A. Sidow, and S. Batzoglou, “LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale MultipleAlignment of Genomic DNA,”

Genome Research , 2003.[29] H. Carrillo and D. Lipman, “The Multiple Sequence Alignment Problemin Biology,”

SIAP , 1988.[30] M. Chaisson, P. Pevzner, and H. Tang, “Fragment Assembly with ShortReads,”

Bioinformatics , 2004.[31] M. J. Chaisson, R. K. Wilson, and E. E. Eichler, “Genetic Variation andthe De Novo Assembly of Human Genomes,”

Nature Reviews Genetics ,2015.[32] E. Check Hayden, “Technology: The 1,000 Genome,”

Nature News ,2014.[33] P. Chen, C. Wang, X. Li, and X. Zhou, “Accelerating the Next Gen-eration Long Read Mapping with the FPGA-Based System,”

TCBB ,2014.[34] L. Chin, J. N. Andersen, and P. A. Futreal, “Cancer Genomics: FromDiscovery Science to Personalized Medicine,”

Nature Medicine , 2011.[35] J. Clarke, H.-C. Wu, L. Jayasinghe, A. Patel, S. Reid, and H. Bayley,“Continuous Base Identification for Single-Molecule Nanopore DNASequencing,”

Nature Nanotechnology et al. , “Evolution and Epidemic Spreadof SARS-CoV-2 in Brazil,”

Science , 2020.[38] J. Daily, “Parasail: SIMD C Library for Global, Semi-Global, and LocalPairwise Sequence Alignments,”

BMC Bioinformatics , 2016.[39] W. R. Davis, J. Wilson, S. Mick, J. Xu, H. Hua, C. Mineo, A. M. Sule,M. Steer, and P. D. Franzon, “Demystifying 3D ICs: The Pros and Consof Going Vertical,”

IEEE Design & Test of Computers , 2005.[40] D. Deamer, M. Akeson, and D. Branton, “Three Decades of NanoporeSequencing,”

Nature Biotechnology , 2016.[41] A. L. Delcher, S. Kasif, R. D. Fleischmann, J. Peterson, O. White, andS. L. Salzberg, “Alignment of Whole Genomes,”

Nucleic Acids Research ,1999.[42] C. N. Dewey, “Whole-Genome Alignment,” in

Evolutionary Genomics ,2019.[43] M. Drumond, A. Daglis, N. Mirzadeh, D. Ustiugov, J. Picorel, B. Falsafi,B. Grot, and D. Pnevmatikatos, “The Mondrian Data Engine,” in

ISCA ,2017.[44] R. C. Edgar, “MUSCLE: Multiple Sequence Alignment with High Accu-racy and High Throughput,”

Nucleic Acids Research , 2004.[45] R. C. Edgar and S. Batzoglou, “Multiple Sequence Alignment,”

COSB ,2006.[46] H. Ellegren, “Genome Sequencing and Population Genomics in Non-Model Organisms,”

Trends in Ecology & Evolution , 2014.[47] A. C. English, S. Richards, Y. Han, M. Wang, V. Vee, J. Qu, X. Qin,D. M. Muzny, J. G. Reid, K. C. Worley, and R. A. Gibbs, “Mind theGap: Upgrading Genomes with Pacific Biosciences RS Long-ReadSequencing Technology,”

PloS One , 2012.[48] N. R. Faria, E. C. Sabino, M. R. Nunes, L. C. J. Alcantara, N. J. Loman,and O. G. Pybus, “Mobile Real-Time Surveillance of Zika Virus inBrazil,”

Genome Medicine , 2016.[49] X. Fei, Z. Dan, L. Lina, M. Xin, and Z. Chunlei, “FPGASW: AcceleratingLarge-Scale Smith–Waterman Sequence Alignment Application withBacktracking on FPGA Linear Systolic Array,”

Interdisciplinary Sciences:Computational Life Sciences , 2018.[50] L. Feuk, A. R. Carson, and S. W. Scherer, “Structural Variation in theHuman Genome,”

Nature Reviews Genetics , 2006.[51] J. W. Fickett, “Fast Optimal Alignment,”

Nucleic Acids Research , 1984.[52] C. Firtina and C. Alkan, “On Genomic Repeats and Reproducibility,”

Bioinformatics , 2016.[53] M. Flores, G. Glusman, K. Brogaard, N. D. Price, and L. Hood, “P4Medicine: How Systems Medicine Will Transform the HealthcareSector and Society,”

Personalized Medicine , 2013.[54] E. J. Fox, K. S. Reid-Bayliss, M. J. Emond, and L. A. Loeb, “Accuracy ofNext Generation Sequencing Platforms,”

Next Generation Sequencing& Applications , 2014.[55] D. Fujiki, A. Subramaniyan, T. Zhang, Y. Zeng, R. Das, D. Blaauw, andS. Narayanasamy, “GenAx: A Genome Sequencing Accelerator,” in

ISCA , 2018.[56] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “TETRIS: Scal-able and Efficient Neural Network Acceleration with 3D Memory,” in

ASPLOS , 2017.[57] E. Georganas, A. Buluc, J. Chapman, L. Oliker, D. Rokhsar, and K. Yelick,“merAligner: A Fully Parallel Sequence Aligner,” in

IPDPS , 2015.[58] S. Ghose, T. Li, N. Hajinazar, D. S. Cali, and O. Mutlu, “DemystifyingComplex Workload-DRAM Interactions: An Experimental Study,” in

SIGMETRICS , 2019.[59] G. S. Ginsburg and H. F. Willard, “Genomic and Personalized Medicine:Foundations and Applications,”

Translational Research , 2009.60] T. C. Glenn, “Field Guide to Next-Generation DNA Sequencers,”

Molec-ular Ecology Resources , 2011.[61] S. Goodwin, J. D. McPherson, and W. R. McCombie, “Coming of Age:Ten Years of Next-Generation Sequencing Technologies,”

Nature Re-views Genetics , 2016.[62] O. Gotoh, “An Improved Algorithm for Matching Biological Sequences,”

Journal of Molecular Biology , 1982.[63] O. Gotoh, “Alignment of Three Biological Sequences with an EfficientTraceback Procedure,”

Journal of Theoretical Biology , 1986.[64] A. L. Greninger, S. N. Naccache, S. Federman, G. Yu, P. Mbala, V. Bres,D. Stryke, J. Bouquet, S. Somasekar, J. M. Linnen, R. Dodd, P. Mulem-bakani, B. S. Schneide, J.-J. Muyembe-Tamfum, S. L. Stramer, and C. Y.Chiu, “Rapid Metagenomic Identification of Viral Pathogens in Clini-cal Samples by Real-Time Nanopore Sequencing Analysis,”

GenomeMedicine , 2015.[65] S. Gupta, M. Imani, B. Khaleghi, V. Kumar, and T. Rosing, “RAPID:A ReRAM Processing in-Memory Architecture for DNA SequenceAlignment,” in

ISLPED , 2019.[66] T. J. Ham, D. Bruns-Smith, B. Sweeney, Y. Lee, S. H. Seo, U. G. Song,Y. H. Oh, K. Asanovic, J. W. Lee, and L. W. Wills, “Genesis: A HardwareAcceleration Framework for Genomic Data Analysis,” in

ISCA , 2020.[67] W. Haque, A. Aravind, and B. Reddy, “Pairwise Sequence AlignmentAlgorithms: A Survey,” in

ISTA , 2009.[68] J. Harcourt, A. Tamin, X. Lu, S. Kamili, S. K. Sakthivel, L. Wang, J. Mur-ray, K. Queen, B. Lynch, B. Whitaker, B. Lynch, R. Gautam, C. Schinde-wolf, K. G. Lokugamage, D. Scharton, J. A. Plante et al. , “Isolation andCharacterization of SARS-CoV-2 from the First US COVID-19 Patient,”bioRxiv 2020.03.02.972935, 2020.[69] D. G. Higgins and P. M. Sharp, “CLUSTAL: A Package for PerformingMultiple Sequence Alignment on a Microcomputer,”

Gene , 1988.[70] M. Höhl, S. Kurtz, and E. Ohlebusch, “Efficient Multiple Genome Align-ment,”

Bioinformatics , 2002.[71] M. Holtgrewe, “Mason–A Read Simulator for Second Generation Se-quencing Data,” Free Univ. of Berlin, Dept. of Mathematics and Com-puter Sci., Tech. Rep. TR-B-10-06, 2010.[72] K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O’Connor, N. Vijayku-mar, O. Mutlu, and S. W. Keckler, “Transparent Offloading and Mapping(TOM): Enabling Programmer-Transparent Near-Data Processing inGPU Systems,” in

ISCA , 2016.[73] K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose,and O. Mutlu, “Accelerating Pointer Chasing in 3D-Stacked Memory:Challenges, Mechanisms, Evaluation,” in

ICCD , 2016.[74] W. Huangfu, S. Li, X. Hu, and Y. Xie, “RADAR: A 3D-ReRAM basedDNA Alignment Accelerator Architecture,” in

DAC , 2018.[75] W. Huangfu, X. Li, S. Li, X. Hu, P. Gu, and Y. Xie, “MEDAL: ScalableDIMM Based Near Data Processing Accelerator for DNA SeedingAlgorithm,” in

MICRO et al. ,“MinION Analysis and Reference Consortium: Phase 1 Data Releaseand Analysis,”

F1000Research , 2015.[83] M. Jain, S. Koren, K. H. Miga, J. Quick, A. C. Rand, T. A. Sasani, J. R.Tyson, A. D. Beggs, A. T. Dilthey, I. T. Fiddes, S. Malla, H. Marriott,T. Nieto, J. O’Grady, H. E. Olsen, B. S. Pedersen et al. , “NanoporeSequencing and Assembly of a Human Genome with Ultra-Long Reads,”

Nature Biotechnology , 2018.[84] M. Jain, J. R. Tyson, M. Loose, C. L. Ip, D. A. Eccles, J. O’Grady, S. Malla,R. M. Leggett, O. Wallerman, H. J. Jansen, V. Zulunin, E. Birney, B. L.Brown, T. P. Snutch, H. E. Olsen, and MinION Analysis ReferenceConsortium, “MinION Analysis and Reference Consortium: Phase 2Data Release and Analysis of R9.0 Chemistry,”

F1000Research , 2017.[85] P. James, D. Stoddart, E. D. Harrington, J. Beaulaurier, L. Ly, S. Reid, D. J.Turner, and S. Juul, “LamPORE: Rapid, Accurate and Highly ScalableMolecular Screening for SARS-CoV-2 Infection, Based on NanoporeSequencing,” medRxiv 2020.08.07.20161737, 2020.[86] JEDEC Solid State Technology Assn., “JESD235C: High BandwidthMemory (HBM) DRAM,” January 2020.[87] X. Jiang, X. Liu, L. Xu, P. Zhang, and N. Sun, “A Reconfigurable Accel-erator for Smith–Waterman Algorithm,”

TCAS-II , 2007. [88] R. Kaplan, L. Yavits, and R. Ginosar, “BioSEAL: In-Memory BiologicalSequence Alignment Accelerator for Large-Scale Genomic Data,” in

PACT , 2019.[89] J. J. Kasianowicz, E. Brandin, D. Branton, and D. W. Deamer, “Charac-terization of Individual Polynucleotide Molecules using a MembraneChannel,”

PNAS , 1996.[90] W. J. Kent, “BLAT—The BLAST-Like Alignment Tool,”

Genome Research ,2002.[91] J. S. Kim, D. Senol Cali, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan,O. Ergin, C. Alkan, and O. Mutlu, “GRIM-Filter: Fast Seed LocationFiltering in DNA Read Mapping using Processing-in-Memory Tech-nologies,”

BMC Genomics , 2018.[92] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A Fast and ExtensibleDRAM Simulator,”

IEEE CAL , 2016.[93] H. T. Kung, “Why Systolic Architectures?”

IEEE Computer , 1982.[94] H. T. Kung and C. E. Leiserson, “Systolic Arrays (for VLSI),” in

SparseMatrix Proceedings , 1978.[95] S. Kurtz, A. Phillippy, A. L. Delcher, M. Smoot, M. Shumway, C. An-tonescu, and S. L. Salzberg, “Versatile and Open Software for Compar-ing Large Genomes,”

Genome Biology , 2004.[96] B. Langmead and S. L. Salzberg, “Fast Gapped-Read Alignment withBowtie 2,”

Nature Methods , 2012.[97] T. Laver, J. Harrison, P. O’neill, K. Moore, A. Farbos, K. Paszkiewicz, andD. J. Studholme, “Assessing the Performance of the Oxford NanoporeTechnologies MinION,”

Biomolecular Detection and Quantification ,2015.[98] C. Lee, C. Grasso, and M. F. Sharlow, “Multiple Sequence Alignmentusing Partial Order Graphs,”

Bioinformatics , 2002.[99] D. Lee, S. Ghose, G. Pekhimenko, S. Khan, and O. Mutlu, “SimultaneousMulti-Layer Access: Improving 3D-Stacked Memory Bandwidth atLow Cost,”

TACO , 2016.[100] V. I. Levenshtein, “Binary Codes Capable of Correcting Deletions,Insertions, and Reversals,” in

Soviet Physics Doklady , 1966.[101] H. Li, “Aligning Sequence Reads, Clone Sequences and Assembly Con-tigs with BWA-MEM,” arXiv:1303.3997 [q-bio.GN], 2013.[102] H. Li, “Minimap2: Pairwise Alignment for Nucleotide Sequences,”

Bioinformatics , 2018.[103] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer,G. Marth, G. Abecasis, and R. Durbin, “The Sequence Alignment/MapFormat and SAMtools,”

Bioinformatics , 2009.[104] H. Li, B. Ni, M.-H. Wong, and K.-S. Leung, “A Fast CUDA Imple-mentation of Agrep Algorithm for Approximate Nucleotide SequenceMatching,” in

SASP , 2011.[105] R. Li, C. Yu, Y. Li, T.-W. Lam, S.-M. Yiu, K. Kristiansen, and J. Wang,“SOAP2: An Improved Ultrafast Tool for Short Read Alignment,”

Bioin-formatics , 2009.[106] H.-N. Lin and W.-L. Hsu, “GSAlign: An Efficient Sequence AlignmentTool for Intra-Species Genomes,”

BMC Genomics , 2020.[107] D. J. Lipman, S. F. Altschul, and J. D. Kececioglu, “A Tool for MultipleSequence Alignment,”

PNAS , 1989.[108] D. J. Lipman and W. R. Pearson, “Rapid and Sensitive Protein SimilaritySearches,”

Science , 1985.[109] Y. Liu and B. Schmidt, “GSWABE: Faster GPU-Accelerated SequenceAlignment with Optimal Alignment Retrieval for Short DNA Se-quences,”

Concurrency Computation , 2015.[110] Y. Liu, A. Wirawan, and B. Schmidt, “CUDASW++ 3.0: AcceleratingSmith–Waterman Protein Database Search by Coupling CPU and GPUSIMD Instructions,”

BMC Bioinformatics , 2013.[111] G. A. Logsdon, M. R. Vollger, and E. E. Eichler, “Long-Read HumanGenome Sequencing and Its Applications,”

Nature Reviews Genetics ,2020.[112] H. Lu, F. Giordano, and Z. Ning, “Oxford Nanopore MinION Sequenc-ing and Genome Assembly,”

Genomics, Proteomics & Bioinformatics ,2016.[113] A. Magi, R. Semeraro, A. Mingrino, B. Giusti, and R. D’Aurizio,“Nanopore Sequencing Data Analysis: State of the Art, Applicationsand Challenges,”

Briefings in Bioinformatics , 2017.[114] T. Mantere, S. Kersten, and A. Hoischen, “Long-Read SequencingEmerging in Medical Genetics,”

Frontiers in Genetics , 2019.[115] G. Marçais, A. L. Delcher, A. M. Phillippy, R. Coston, S. L. Salzberg,and A. Zimin, “MUMmer4: A Fast and Versatile Genome AlignmentSystem,”

PLoS Computational Biology , 2018.[116] V. Marx, “Nanopores: A Sequencer in Your Backpack,”

Nature Methods ,2015.[117] W. Miller and E. W. Myers, “Sequence Comparison with ConcaveWeighting Functions,”

Bulletin of Mathematical Biology , 1988.[118] O. Mutlu, “Accelerating Genome Analysis: A Primer on an OngoingJourney,”

Keynote Talk at AACBB , 2019.[119] O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, “Process-ing Data Where It Makes Sense: Enabling In-Memory Computation,”

MICPRO , 2019.[120] E. W. Myers and W. Miller, “Optimal Alignments in Linear Space,”

Bioinformatics , 1988.[121] G. Myers, “A Fast Bit-Vector Algorithm for Approximate String Match-ing Based on Dynamic Programming,”

Journal of the ACM , 1999.122] A. Nag, C. Ramachandra, R. Balasubramonian, R. Stutsman, E. Gia-comin, H. Kambalasubramanyam, and P.-E. Gaillardon, “GenCache:Leveraging In-Cache Operators for Efficient Sequence Alignment,” in

MICRO , 2019.[123] K. Nakano, A. Shiroma, M. Shimoji, H. Tamotsu, N. Ashimine, S. Ohki,M. Shinzato, M. Minami, T. Nakanishi, K. Teruya, K. Satou, and T. Hi-rano, “Advantages of Genome Sequencing by Long-Read Sequencerusing SMRT Technology in Medical Area,”

Human Cell

CSUR ,2001.[126] S. B. Needleman and C. D. Wunsch, “A General Method Applicableto the Search for Similarities in the Amino Acid Sequence of TwoProteins,”

Journal of Molecular Biology , 1970.[127] C. Notredame, “Recent Progress in Multiple Sequence Alignment: ASurvey,”

Pharmacogenomics , 2002.[128] C. Notredame, D. G. Higgins, and J. Heringa, “T-Coffee: A NovelMethod for Fast and Accurate Multiple Sequence Alignment,”

JMB

Bioinformatics

Genome Research , 2011.[137] A. Pattnaik, X. Tang, A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir,O. Mutlu, and C. R. Das, “Scheduling Techniques for GPU Architectureswith Processing-In-Memory Capabilities,” in

PACT , 2016.[138] P. A. Pevzner, H. Tang, and M. S. Waterman, “An Eulerian Path Ap-proach to DNA Fragment Assembly,”

PNAS , 2001.[139] J. Prado-Martinez, P. H. Sudmant, J. M. Kidd, H. Li, J. L. Kelley,B. Lorente-Galdos, K. R. Veeramah, A. E. Woerner, T. D. O’Connor,G. Santpere, A. Cagan, C. Theunert, F. Casals, H. Laayouni, K. Munch,A. Hobolth et al. , “Great Ape Genetic Diversity and Population History,”

Nature , 2013.[140] A. Prohaska, F. Racimo, A. J. Schork, M. Sikora, A. J. Stern, M. Ilardo,M. E. Allentoft, L. Folkersen, A. Buil, J. V. Moreno-Mayar, T. Ko-rneliussen, D. Geschwind, A. Ingason, T. Werge, R. Nielsen, andE. Willerslev, “Human Disease Variation in the Light of PopulationGenomics,”

Cell , 2019.[141] M. A. Quail, M. Smith, P. Coupland, T. D. Otto, S. R. Harris, T. R.Connor, A. Bertoni, H. P. Swerdlow, and Y. Gu, “A Tale of Three NextGeneration Sequencing Platforms: Comparison of Ion Torrent, PacificBiosciences and Illumina MiSeq Sequencers,”

BMC Genomics , 2012.[142] J. Quick, N. J. Loman, S. Duraffour, J. T. Simpson, E. Severi, L. Cow-ley, J. A. Bore, R. Koundouno, G. Dudas, A. Mikhail, N. Ouédraogo,B. Afrough, A. Bah, J. H. J. Baum, B. Becker-Ziaja, J. P. Boettcher et al. , “Real-Time, Portable Genome Sequencing for Ebola Surveillance,”

Nature , 2016.[143] J. Quick, A. R. Quinlan, and N. J. Loman, “A Reference BacterialGenome Dataset Generated on the MinION Portable Single-MoleculeNanopore Sequencer,”

Gigascience , 2014.[144] J. A. Reuter, D. V. Spacek, and M. P. Snyder, “High-Throughput Se-quencing Technologies,”

Molecular Cell , 2015.[145] A. Rhoads and K. F. Au, “PacBio Sequencing and Its Applications,”

Genomics, Proteomics & Bioinformatics , 2015.[146] R. J. Roberts, M. O. Carneiro, and M. C. Schatz, “The Advantages ofSMRT Sequencing,”

Genome Biology , 2013.[147] E. Rucci, C. Garcia, G. Botella, A. De Giusti, M. Naiouf, and M. Prieto-Matias, “SWIFOLD: Smith–Waterman Implementation on FPGA withOpenCL for Long DNA Sequences,”

BMC Systems Biology , 2018.[148] SAFARI Research Group, “GenASM — GitHub Repository.” https://github.com/CMU-SAFARI/GenASM[149] SAFARI Research Group, “Shouji — GitHub Repository.” https://github.com/CMU-SAFARI/Shouji[150] D. Sankoff, “Minimal Mutation Trees of Sequences,”

SIAP , 1975.[151] S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison,D. Haussler, and W. Miller, “Human–Mouse Alignments with BLASTZ,”

Genome Research , 2003.[152] D. Senol Cali, J. S. Kim, S. Ghose, C. Alkan, and O. Mutlu, “NanoporeSequencing Technology and Tools for Genome Assembly: Computa-tional Analysis of the Current State, Bottlenecks and Future Directions,”

Briefings in Bioinformatics , 2018. [153] J. Shendure, S. Balasubramanian, G. M. Church, W. Gilbert, J. Rogers,J. A. Schloss, and R. H. Waterston, “DNA Sequencing at 40: Past,Present and Future,”

Nature , 2017.[154] T. F. Smith and M. S. Waterman, “Identification of Common MolecularSubsequences,”

Journal of Molecular Biology , 1981.[155] M. Šošić and M. Šikić, “Edlib: A C/C++ Library for Fast, Exact SequenceAlignment Using Edit Distance,”

Bioinformatics

Nucleic Acids Research , 1994.[158] C. Trapnell and S. L. Salzberg, “How to Map Billions of Short Readsonto Genomes,”

Nature Biotechnology , 2009.[159] T. J. Treangen and S. L. Salzberg, “Repetitive DNA and Next-GenerationSequencing: Computational Challenges and Solutions,”

Nature ReviewsGenetics , 2011.[160] Y. Turakhia, S. D. Goenka, G. Bejerano, and W. J. Dally, “Darwin-WGA: A Co-processor Provides Increased Sensitivity in Whole GenomeAlignments with High Speedup,” in

HPCA , 2019.[161] Y. Turakhia, “Darwin — GitHub Repository.” https://github.com/yatisht/darwin[162] Y. Turakhia, G. Bejerano, and W. J. Dally, “Darwin: A Genomics Co-Processor Provides up to 15,000x Acceleration on Long Read Assembly,”in

ASPLOS , 2018.[163] E. Ukkonen, “Algorithms for Approximate String Matching,”

Informa-tion and Control , 1985.[164] E. L. van Dijk, H. Auger, Y. Jaszczyszyn, and C. Thermes, “Ten Yearsof Next-Generation Sequencing Technology,”

Trends in Genetics , 2014.[165] E. L. van Dijk, Y. Jaszczyszyn, D. Naquin, and C. Thermes, “The ThirdRevolution in Sequencing Technology,”

Trends in Genetics , 2018.[166] J. C. Venter, M. D. Adams, E. W. Myers, P. W. Li, R. J. Mural, G. G.Sutton, H. O. Smith, M. Yandell, C. A. Evans, R. A. Holt, J. D. Gocayne,P. Amanatides, R. M. Ballew, D. H. Huson, J. R. Wortman, Q. Zhang et al. , “The Sequence of the Human Genome,”

Science , 2001.[167] J. Wang, N. E. Moore, Y.-M. Deng, D. A. Eccles, and R. J. Hall, “MinIONNanopore Sequencing of an Influenza Genome,”

Frontiers in Microbiol-ogy , 2015.[168] M. S. Waterman, “Efficient Sequence Alignment Algorithms,”

Journalof Theoretical Biology , 1984.[169] M. S. Waterman, T. F. Smith, and W. A. Beyer, “Some Biological Se-quence Metrics,”

Advances in Mathematics , 1976.[170] J. L. Weirather, M. de Cesare, Y. Wang, P. Piazza, V. Sebastiano, X.-J.Wang, D. Buck, and K. F. Au, “Comprehensive Comparison of PacificBiosciences and Oxford Nanopore Technologies and Their Applica-tions to Transcriptome Analysis,”

F1000Research , 2017.[171] A. M. Wenger, P. Peluso, W. J. Rowell, P.-C. Chang, R. J. Hall, G. T.Concepcion, J. Ebler, A. Fungtammasan, A. Kolesnikov, N. D. Olson,A. Töpfer, M. Alonge, M. Mahmoud, Y. Qian, C.-S. Chin, A. M. Phillippy et al. , “Accurate Circular Consensus Long-Read Sequencing ImprovesVariant Detection and Assembly of a Human Genome,”

Nature Biotech-nology , 2019.[172] S. J. Wilton and N. P. Jouppi, “CACTI: An Enhanced Cache Access andCycle Time Model,”

JSSC , 1996.[173] F. Wu, S. Zhao, B. Yu, Y.-M. Chen, W. Wang, Z.-G. Song, Y. Hu, Z.-W.Tao, J.-H. Tian, Y.-Y. Pei, M.-L. Yuan, Y.-L. Zhang, F.-H. Dai, Y. Liu,Q.-M. Wang, J.-J. Zheng et al. , “A New Coronavirus Associated withHuman Respiratory Disease in China,”

Nature , 2020.[174] S. Wu and U. Manber, “Fast Text Searching Allowing Errors,”

CACM

Bioinformatics , 2015.[177] H. Xin, D. Lee, F. Hormozdiari, S. Yedkar, O. Mutlu, and C. Alkan,“Accelerating Read Mapping with FastHASH,”

BMC Genomics , 2013.[178] H. Xin, S. Nahar, R. Zhu, J. Emmons, G. Pekhimenko, C. Kingsford,C. Alkan, and O. Mutlu, “Optimal Seed Solver: Optimizing Seed Selec-tion in Read Mapping,”

Bioinformatics , 2016.[179] Y. Yang, B. Xie, and J. Yan, “Application of Next-Generation SequencingTechnology in Forensic Science,”

Genomics, Proteomics & Bioinformat-ics , 2014.[180] F. Zokaee, H. R. Zarandi, and L. Jiang, “AligneR: A Process-in-MemoryArchitecture for Short Read Alignment in ReRAMs,”

IEEE CAL , 2018.[181] F. Zokaee, M. Zhang, and L. Jiang, “FindeR: Accelerating FM-Index-Based Exact Pattern Matching in Genomic Sequences Through ReRAMTechnology,” in

PACT , 2019.[182] Q. Zou, Q. Hu, M. Guo, and G. Wang, “HAlign: Fast Multiple SimilarDNA/RNA Sequence Alignment Based on the Centre Star Strategy,”