[PDF] EXMA: A Genomics Accelerator for Exact-Matching

Abstract

Genomics is the foundation of precision medicine, global food security and virus surveillance. Exact-match is one of the most essential operations widely used in almost every step of genomics such as alignment, assembly, annotation, and compression. Modern genomics adopts Ferragina-Manzini Index (FM-Index) augmenting space-efficient Burrows-Wheeler transform (BWT) with additional data structures to permit ultra-fast exact-match operations. However, FM-Index is notorious for its poor spatial locality and random memory access pattern. Prior works create GPU-, FPGA-, ASIC- and even process-in-memory (PIM)-based accelerators to boost FM-Index search throughput. Though they achieve the state-of-the-art FM-Index search throughput, the same as all prior conventional accelerators, FM-Index PIMs process only one DNA symbol after each DRAM row activation, thereby suffering from poor memory bandwidth utilization. In this paper, we propose a hardware accelerator, EXMA, to enhance FM-Index search throughput. We first create a novel EXMA table with a multi-task-learning (MTL)-based index to process multiple DNA symbols with each DRAM row activation. We then build an accelerator to search over an EXMA table. We propose 2-stage scheduling to increase the cache hit rate of our accelerator. We introduce dynamic page policy to improve the row buffer hit rate of DRAM main memory. We also present CHAIN compression to reduce the data structure size of EXMA tables. Compared to state-of-the-art FM-Index PIMs, EXMA improves search throughput by 4.9\times, and enhances search throughput per Watt by 4.8\times.

Full PDF

aa r X i v : . [ c s . A R ] J a n EXMA: A Genomics Accelerator for Exact-Matching

Lei Jiang

Indiana University Bloomington [email protected]

Farzaneh Zokaee

Indiana University Bloomington [email protected]

Abstract —Genomics is the foundation of precision medicine,global food security and virus surveillance. Exact-match is one ofthe most essential operations widely used in almost every step ofgenomics such as alignment, assembly, annotation, and compres-sion. Modern genomics adopts Ferragina-Manzini Index (FM-Index) augmenting space-efﬁcient Burrows-Wheeler transform(BWT) with additional data structures to permit ultra-fast exact-match operations. However, FM-Index is notorious for its poorspatial locality and random memory access pattern. Prior workscreate GPU-, FPGA-, ASIC- and even process-in-memory (PIM)-based accelerators to boost FM-Index search throughput. Thoughthey achieve the state-of-the-art FM-Index search throughput,the same as all prior conventional accelerators, FM-Index PIMsprocess only one DNA symbol after each DRAM row activation,thereby suffering from poor memory bandwidth utilization.In this paper, we propose a hardware accelerator, EXMA, toenhance FM-Index search throughput. We ﬁrst create a novelEXMA table with a multi-task-learning (MTL)-based index toprocess multiple DNA symbols with each DRAM row activation.We then build an accelerator to search over an EXMA table.We propose 2-stage scheduling to increase the cache hit rate ofour accelerator. We introduce dynamic page policy to improvethe row buffer hit rate of DRAM main memory. We alsopresent CHAIN compression to reduce the data structure sizeof EXMA tables. Compared to state-of-the-art FM-Index PIMs,EXMA improves search throughput by . × , and enhancessearch throughput per Watt by . × . Index Terms —Domain-Speciﬁc Hardware Accelerator, Ge-nomics, Exact-Matching

I. I

NTRODUCTION

Because of the huge advancement of sequencing technolo-gies such as Illumina [1], PacBio SMRT [2], and OxfordNanopore [3], sequencing a entire human genome requiresonly < day. The big genomic data has been a cornerstoneto enabling personalized healthcare [4], and ensuring globalfood security [5]. Recently, genome sequencing becomes apowerful tool to ﬁght virus outbreaks, e.g., Ebola [6], Zika [7]and COVID-19 [8].However, it is challenging to process and analyze hugevolumes of genomic data generated by high throughput se-quencers that scale faster than Moore’s Law [9]. For instance,thousands of USB-drive-size Oxford Nanopore Minion se-quencers are deployed to monitor virus outbreaks [6]–[8] in thewild by generating several terabytes data per day. Analyzinga single genome may take hundreds of CPU hours [10],[11] even on high-end servers. To overcome the loomingcrisis of big genomic data, the application-speciﬁc hardwareacceleration has become essential for genomics. This work was partially supported by the National Science Foundation(NSF) through awards CCF-1908992 and CCF-1909509. a lig n m e n ta s s e m b lya lig n m e n ta s s e m b lya lig n m e n ta s s e m b lya n n o ta tec o m p re s s

PacBioNanopore execution time breakdown

FM -Index DynPro Other

Illum ina

Fig. 1. Execution time breakdown of human genome analysis (

DynPro means dynamic programming).

A genome sequencing pipeline [4] sequences organicgenomes, archives genomic data, analyzes genome sequences,and generates genetic variants that can used for patient treat-ment. Therefore, the latency of genome sequencing is a matterof life and death.

Read alignment [12], which aligns reads,i.e., small DNA fragments, against a long genome reference,is identiﬁed as one of the most time-consuming steps [10],[11], [13]–[15] in genome analysis. Read alignment adoptsthe seed-and-extend paradigm [12], [16], and thus includestwo major stages, i.e., seeding and seed extension . Duringseeding, parts of each read are mapped to their exactly matchedpositions, i.e., seeds, of the long reference by hash tables [10],[11], [17] or Ferragina-Manzini Index (FM-Index) [12], [16].Seed extension pieces together a larger sequence with seedsand edit distance errors, i.e., insertions, deletions (indels) andsubstitutions, by dynamic programming [18]–[20].State-of-the-art read alignment applications such as BWA-MEM [12], MA [16] and SOAP [21] use FM-Index to buildsuper-maximal exact matches (SMEMs) during seeding, sinceit augments the space-efﬁcient Burrows-Wheeler transform(BWT) [22] with accessory data structures that permit ultra-fast exact-match operations. SMEMs generated by FM-Indexguarantee each seed does not overlap other seeds and has themaximal length that cannot be further extended. Compared tohash tables, FM-Index reduces not only the number of errorsin output genome mappings but also the durations of seedextension substantially [23].Besides read alignment, FM-Index is widely used for exact-match operations in other time-consuming steps of genomeanalysis such as genome assembly [24], annotation [25]and compression [26]. Figure 1 shows the execution timebreakdown of various genome analysis applications on ahuman genome . On average, FM-Index searches cost % ∼ % of the execution time of these genome analysisapplications . Since Illumina, Nanopore and PacBio genomesequencers generate reads having different lengths and errorrates, aligning and assembling these reads require different The experimental methodology is elaborated in §V. mounts of time for FM-Index searches. The reads producedby Illumina machines have lower error rates, so an Illuminadataset invokes FM-Index searches more frequently.However, FM-Index is notorious for its poor spatial localityand random memory access pattern [27]. The kernel of con-ventional FM-Index search is pointer chasing. After activatingone DRAM row, FM-Index processes only one DNA symbol,thereby greatly decreasing DRAM bandwidth utilization. Al-though a recent algorithmic work, LISA [28], uses a learnedindex [29] to search multiple DNA symbols after each rowactivation, the learned index accuracy is low. LISA has tosearch many unnecessary entries, and thus achieves only mod-erate search throughput improvement. Beyond CPUs [12] andGPUs [21], prior work creates FPGA [13], [30]-, ASIC [31]-, and even processing-in-memory (PIM) [14], [15]-based de-signs to accelerate conventional FM-Index searches processingonly one DNA symbol after each row activation. Therefore,these accelerators are fundamentally limited by the pooroff-chip memory bandwidth utilization. Instead of searchingmultiple DNA symbols after each row activation, a recentDRAM PIM, MEDAL [15], achieves the state-of-the-art searchthroughput by enabling DRAM chip-level parallelism, whereeach chip can independently activate a partial row to processa DNA symbol. However, we observe that there are a lotof conﬂicts on the DDR4 address bus shared by all chipsin a rank. The shared address bus seriously limits searchthroughput of MEDAL.In this paper, we propose an algorithm and hardware co-designed accelerator,

EXMA , to process EX act- MA tch oper-ations during genome analysis. Our contributions are summa-rized as follows: • An EXMA table with a MTL-based index –

We proposea novel data structure, EXMA table, that can process k DNA symbols, i.e., a k -mer, in a DRAM row in eachFM-Index search iteration. We further present a multi-task-learning (MTL)-based index to accelerate searches over anEXMA table. The MTL-based index trained with multiple k -mers uses less neural network parameters, but obtains higheraccuracy over learning to search each k -mer independently. • A hardware accelerator –

We build an accelerator to searchan EXMA table with a MTL-based index. We present a 2-stage scheduling to increase the hit rate of on-chip caches ofour accelerator for the table and its index. We also proposedynamic page policy to improve the row buffer hit rateof DRAM main memory. At last, we introduce CHAINcompression to greatly reduce the data structure size of anEXMA table. • Search throughput and throughput per Watt –

We eval-uated and compare EXMA to prior CPU-, GPU-, FPGA-,ASIC-, and PIM-based FM-Index accelerators. Compared tothe state-of-the-art DRAM PIM MEDAL, EXMA improvessearch throughput by . × , and enhances search throughputper Watt by . × . reference CT GG AA GG ACT TGGC AGGG AA GA GGGG AGGG AAreadssequencing error genetic variation (a) Read alignment reference CT GG seedCT GC TGquery query not seed (b) seed-&-extendFig. 2. Read alignment II. B

ACKGROUND

A. Read Alignment

Seed-and-Extend : As one of the bottlenecks in genomeanalysis, read alignment may consume hundreds of CPUhours [10], [11], [13]. During read alignment, DNA readsgenerated by various sequencing machines, e.g., Illumina,PacBio SMRT, and Oxford Nanopore, are mapped to a pre-existing genome reference, as shown in Figure 2(a). Readalignment is complicated by the fact that there are geneticvariations in the human population, and sequencing machinesalso introduce sequencing errors [32]. The overall variationof human population has been estimated as . [9], whilethe sequencing error rate of various sequencing machines is . ∼ [32]. To reduce sequencing errors, a sequencingmachine produces ∼ reads to cover every position inthe genome. As Figure 2(b) shows, read alignment adoptsthe seed-and-extend paradigm [12], [16], [24], [33]–[35] toaccommodate sequencing errors and genetic variations. Duringseeding, a read is divided into multiple smaller parts that arealigned against the reference. If a part is exactly matched,it becomes a seed. The computationally expensive Smith-Waterman algorithm [10], [11], [13] is invoked only aroundseeds to handle sequencing errors and genetic variations. Exact-Match Operation : The alphabet P of DNA includes A , C , G and T . Given a genome reference G ∈ P ∗ of length |G| and a query Q ∈ P ∗ of length | Q | , the seeding, akaexact-match problem, is to ﬁnd all occurrences of Q in G .A na¨ıve algorithm of exhausting all possible positions for Q will take O ( |G|| Q | ) comparisons, which is infeasible for largegenome. It is possible to use a hash table [10], [11], [17]to support exact-match operations with hundreds of gigabytesDRAM, but the hash-table-based seeding not only degradesgenome mapping quality but also prolongs seed extension du-rations [23]. State-of-the-art alignment algorithms [12], [16],[21] use FM-Index for seeding. To search a query Q overthe reference genome G , FM-Index occupies O ( |G| log ( |G| )) DRAM space and does O ( | Q | ) comparisons during a search. B. FM-Index1) Burrows-Wheeler Transform

FM-index is built upon BWT [22]. To compute the BWTof a genome reference G , we can list all its circularly shiftedsequences. For instance, Figure 3(a) shows G = CAT AGA $ ,where $ indicates the end of the sequence and it is the lexico-graphically smallest symbol. Circularly shifted sequences of G = CAT AGA $ can be listed as $ CAT AGA , A $ CAT AG , . . . , and CAT AGA $ , which can be sorted in the lexico-graphical order to form a Burrows-Wheeler (BW)-matrix.The last column of the BW-matrix is the BWT of G , i.e.,

6 $ C A T A G A

1 5 A $ C A T A G

2 3 A G A $ C A T

3 1 A T A G A $ C

4 0

C A T A G A $

5 4 G A $ C A T A

6 2 T A G A $ C A i SA BW matrix (a) BWT BWT=AGTC$AA

0 0 0 0 01 1 0 0 02 1 0 1 03 1 0 1 14 1 1 1 16 2 1 1 17 3 1 1 15 1 1 1 1i A C G T1 4 5 6A C G T (b)

Occ(s,i) table(c)

Count(s) // BWT: the genome reference BWT; //Q : the query; // |Q|: the query length;

0: low= 0; high= MAX_OCC; 1: for (i=|Q|-1; i ≥ 0; i--) { 2: low = Count(Q[i])+Occ(Q[i], low); high = Count(Q[i])+Occ(Q[i], high); if (low ≥ high) return ;5: }6: for ( i=low; i < high; i++) {7: final=SA[i]; // reference positions

8: } (d) Backward search query: TAG ref: CATAGA (e) Example(0,7)→(5,6)Iter 0 for GIter 1 for A(5,6)→(2,3)Iter 2 for T(2,3)→(6,7)→ SA[6]=2 (f) FM-IndexA C G T1 4 5 6

Bucket-0BWT: AGTCMarker

A C G T2 5 6 7

Bucket-1BWT: $AA-Marker

Fig. 3. FM-Index overview: (a) the BWT of a genome reference G = CAT AGA $ ; (b) an Occ ( s, i ) table; (c) a Count ( s ) table; (d) backward search; (e)a search example; and (f) a bucket-based data structure BW T ( G ) = AGT C $ AA . The sub-sequence ending with $ in each row of the BW-matrix is a sufﬁx of G , which canbe denoted by an integer (SA in Figure 3(a)) recording itsstarting position in the reference. For example, AT AGA $ is SA [3] = 1 , which means it starts from the position 1 of G .

2) FM-index

The data structure and search algorithm of FM-Index canbe summarized as: • Occ and Count . FM-index searches are implemented withtwo functions

Occ ( s, i ) and Count ( s ) over the BWT of G . As Figure 3(b) exhibits, Occ ( s, i ) returns the number ofsymbol s in the BWT from the position to the position i − , e.g., Occ ( C,

5) = 1 , which means that there is only1 C from the position 0 to the position 4 of BW T ( G ) = AGT C $ AA . Count ( s ) shown in Figure 3(c) computes thenumber of symbols in the BWT that are lexicographicallysmaller than the symbol s , e.g., Count ( T ) = 6 , which indi-cates that there are 6 symbols in BW T ( G ) = AGT C $ AA lexicographically smaller than T . • Backward Search . An exact-match operation is imple-mented by backward search, whose algorithm can be viewedin Figure 3(d). The interval ( low, high ) covers a rangeof indices in the BW-matrix where the sufﬁxes have thesame preﬁx. The pointer low locates the index in the BW-matrix where the pattern is ﬁrst found as a preﬁx, whilethe pointer high provides the index after the one where thepattern is last found. At ﬁrst, low and high are initializedto the minimum and maximum indexes of the Occ tablerespectively. And then, they iterate from the last symbol ina query Q to the ﬁrst. The pointer pos is updated by Count ( Q [ i ]) + Occ ( Q [ i ] , pos ) (1)where Q [ i ] indicates the i th symbol in the query Q . Thepointer pos can be low or high , as shown from the lines 2to 3 in Figure 3(d). The computations of low and high are pointer chasing and thus suffer from poor spatiallocality [14], [15], [30]. Finally, the interval ( low, high ) gives the range of indexes in the BW-matrix where thesufﬁxes have the target query as a preﬁx. These indexesare converted to reference genome positions using SA.Figure 3(e) illustrates an example of searching a query T AG in the reference G = CAT AGA $ . Before a search happens, ( low, high ) is initialized to (0 , . In the iteration 0, the lastsymbol G is processed, and then ( low, high ) is updated to (5 , . After three iterations, ( low, high ) eventually equals (6 , . By looking up SA [6] = 2 in Figure 3(a), we ﬁnd thatthe query T AG in reference G = CAT AGA $ starts fromthe position 2. • Bucket-based Storage . Both

Count ( s ) and Occ ( s, i ) canbe pre-calculated and stored. However, the storage overheadof Occ ( s, i ) is proportional to the genome reference length |G| , and thus signiﬁcant. To keep the storage overhead incheck, the Occ ( s, i ) values are sampled into buckets ofwidth d shown in Figure 3(f) ( d = 4 ). The Occ ( s, i ) valuesare stored each d positions as markers to reduce the storageoverhead by a factor of d . The omitted Occ ( s, i ) valuescan be reconstructed by summing the previous marker andthe number of symbol s from the remaining positions inthe BWT bucket. To simplify searches, Count ( s ) values ofeach symbol are added to corresponding markers. Markersand BWT buckets are interleaved to build a FM-Index.

3) Multi-step FM-Index i AA AC TTTG...0 123 f f f f ...456 0 0 00...0 1 00...1 1 00...2 1 00...2 1 01...2 1 11...3 1 11... ... ... ... ... ... | G | Fig. 4. A 2-step table.

During an iteration of the FM-Indexbackward search, two memory accessesfor low and high are issued for eachsymbol in a query Q . Totally, | Q | mem-ory accesses are required for an exactmatch operation of a query. The FM-Index backward search performance isseriously limited by random memory ac-cesses [14], [15], [30], since each accessopens a DRAM row but fetches only B . k -step FM-Index [36] is proposed to reduce the number of memoryaccesses to | Q | k by updating a k -mer, i.e., k DNA symbols,in each search iteration. The idea of k -step FM-Index is toenlarge the alphabet size from Σ to Σ k . For instance, if k = 2 , instead of single DNA symbols, as Figure 4 shows,the enlarged alphabet includes 16 2-mers: AA , AC , . . . , T T .We can construct a BWT with the enlarged alphabet and itscorresponding FM-Index to perform k -step backward searchin the same way. The trade-off for k -step FM-Index is theincrease in its size, which is calculated as F = ⌈ log ( |G| ) ⌉ · |G| · | Σ | k d + |G| · ⌈ log ( | Σ | k + 1) ⌉ (2) G$,0]m m m ...m m m i n e u r a l n e t w o r k s (a) IP-BWT (b) ExampleIter 1 for TA [TA,1]→6[TA,7]→6 (c) Learned Index [CA, 6] [GA, 0] [TA,5]... i6 T A 50 $ C 3 1 A $ 4 2 A G 13 A T 24 C A 65 G A 0 ❷ ❸ ❹ [GT,6]→1 ❺ [G$,0]→ Iter 0 for G ❶ query: TAG → TA, G N k -mer Fig. 5. LISA: (a) an IP-BWT array; (b) a search example; (c) a learned index. where k is the number of DNA symbols updated in each searchiteration. The size of multi-step FM-Index exponentially in-creases with an enlarging k .

4) Learned Indexes for Sequence Analysis

To support multi-step searches with smaller DRAM over-head, a recent work proposes Learned Indexes for SequenceAnalysis (LISA) [28] consisting of an Index-Paired BWT (IP-BWT) array and a learned index. • IP-BWT . In Figure 5(a), each entry of the IP-BWT is apair of [ k -mer, N ], where k -mer is the ﬁrst k symbols ofthe corresponding BW-matrix row, and N is the row numberof the sequence with the ﬁrst k and the last |G| − k symbolsswapped in the BW-matrix. For example, if k = 2 , the k -mer of the row 0 of the IP-BWT can be derived from therow 0 of the BW-matrix, $ CAT AGA , shown in Figure 3(a)using only the ﬁrst 2 symbols $ C . By sweeping the ﬁrst 2symbols and the other 5 symbols of $ CAT AGA , we canhave

AT AGA $ C , which is the row 3 of the BW-matrix. Sothe row 0 of the IP-BWT is [ $ C , 3]. • Backward Search . The backward search of LISA ﬁnds thelower bound position of a [ k -mer, N ] pair in the IP-BWT.Since the IP-BWT is sorted, LISA adopts binary searchfor backward searches. As Figure 5(b) shows, to search thequery T AG , we ﬁrst break it into

T A and G , since eachiteration can process a 2-mer. In the ﬁrst iteration, we startwith G . The padding algorithm [28] of LISA converts G to G $ for low and GT for high . low and high are initializedto 0 and 6 respectively. ❶ To search [ G $ , 0], a binary searchis performed over the IP-BWT. ❷ During the binary search, G $ is ﬁrst compared against AT , i.e., the row 3 of the IP-BWT. ❸ Because G $ > AT , the binary search goes to therow 5, i.e., GA of the IP-BWT. ❹ Finally, it ends with therow 4 of the IP-BWT, i.e., CA . ❺ Since G $ > CA , thenew low is calculated as . high can be computedin the same way. Each search iteration requires log ( |G| ) comparisons due to binary search. • Learned Index . To reduce the number of comparisonsduring binary searches, LISA adopts a learned index [29],i.e., a model hierarchy consisting of multiple neural networkmodels, as shown in Figure 5(c), where m i is the neuralnetwork model i . The learned index enables LISA to doonly one comparison during each iteration in the best case.To search [ G $ , by the learned index, we can traverse downlower-level neural network models based on the output of the higher-level neural network models. Finally, a leaf neuralnetwork model predicts the position of [ G $ , in the IP-BWT. However, if the predicted position does not contain [ G $ , , a linear search over the IP-BWT starts from thepredicted position to ﬁnd its actual position.III. R ELATED W ORK AND D ESIGN M OTIVATION

It is challenging to achieve high-throughput and power-efﬁcient FM-Index searches by state-of-the-art FM-Index al-gorithms and accelerators.

A. Algorithm Inefﬁciency

Intractable Size of k -step FM-Index . We recorded the rowIDs of 200 consecutive 1-step FM-Index search iterations inFigure 6(a), where 197 different rows are accessed. Because 1-step FM-Index processes only 1 DNA symbol in each iteration,in most cases, searching a DNA symbol by 1-step FM-Indexrequires one row activation. Prior accelerators [14], [15] for1-step FM-Index expect no row buffer hit and thus adoptclose-page policy. Though k -step FM-Index can search k DNAsymbols by activating a row, as Figure 6(b) shows, its datastructure size exponentially increases with the step number k . For instance, 5-step FM-Index costs 105GB, while 6-stepFM-Index occupies 374GB. As a result, 5-step FM-Index( FM-5 ) improves search throughput by only . × over 1-step FM-Index, as shown in Figure 6(d). Further enlarging thestep number of k -step FM-Index ( FM-6 ) decreases its searchthroughput improvement, due to more TLB misses.

Weakness of LISA Learned Index . LISA can search k DNA symbols after each row activation by its IP-BWT.Moreover, as Figure 6(b) shows, the size of LISA linearlyincreases with the step number k . However, the learnedindex of LISA suffers from low accuracy and low cachehit rate . First, the LISA learned index has to index |G| IP-BWT entries, where |G| is the length of reference genome.For a human genome, |G| = 3 G . When the learned indexof LISA predicts a wrong position, LISA has to linearlysearch up to |G| IP-BWT entries. As Figure 6(c) describes,on average, LISA has to search ∼ K extra IP-BWT en-tries during each iteration, due to the low accuracy of itslearned index. Consequently, compared to 1-step FM-Index,21-step LISA ( LISA-21 ) improves search throughput by only . × in Figure 6(d). But if LISA has a perfect learnedindex ( LISA-21P ) which always predicts the right position,compared to 1-step FM-Index,

LISA-21P improves searchthroughput by . × . Further increasing the step number ofLISA ( LISA-32 ) also introduces more TLB misses, therebyachieving smaller search throughput improvement. Second,traversing the learned index’s hierarchical models is alsopointer chasing. The LISA learned index consumes ∼ . GB .If we assume a perfect cache (100% hit) for LISA-21P ( LISA-21PC ), LISA-21PC improves search throughput by . × over 1-step FM-Index. Algorithmic Takeaways . (1) Implementing k -step searchwith moderately enlarged data structure is important forFM-Index. However, a larger step number, i.e., k , may notnecessarily lead to better search throughput. (2) A more

50 100 150 20001x10 Row id

FM -Index accesses (a) Random FM-Index accesses

Size (GB) step (b) DRAM overhead v.s. step Error m ax (c) Inaccurate LISA-21

F M - 4F M - 5F M - 6L I S A - 1 1L I S A - 2 1L I S A - 3 2L I S A - 2 1 PL I S A - 2 3 P C

Norm. throughput (d) Throughput on CPU baselineFig. 6. The inefﬁciency of prior FM-Index algorithms (A human genome dataset is used). In Figure 6(c), we present the maximal and minimal errors, themean of errors, and the th , th , th percentiles of errors. In Figure 6(d), FM-X means X-step conventional FM-Index;

LISA-X means X-step LISA;

LISA-XP means X-step LISA with a 100% accurate index; and

LISA-XPC denotes X-step LISA with a 100% accurate index and a 100% hit cache.

RAS RAS CAS CAS R-A R-A C-A C-A RASCASADDR

RAS R-A C-A CAS CKChip Chip Chip data data data RAS delay!!! R-A Chip Fig. 7. The address bus bottleneck of MEDAL. accurate learned index reduces linear search overhead toimprove search throughput. (3) A higher cache hit rate oflearned index also improves search throughput by reducingredundant memory accesses to learned index.B. Hardware Incapacity

FPGAs and ASICs . Most prior FM-Index hardware ar-tifacts such as CPUs [12], GPUs [21], FPGAs [13], andASICs [37] can accelerate only 1-step FM-Index searches. Arecent FPGA design [30] supports 2-step FM-Index searches,while k -step LISA is built merely on CPUs. Existing FM-Index application-speciﬁc accelerators including FPGAs [13],[30], and ASICs [37] can search only 1 or 2 DNA symbolsafter each DRAM row activation. Therefore, they cannot fullyexploit the maximal DRAM bandwidth. Processing-In-Memories . Though recent works [14], [15],[38], [39] propose PIM accelerators to process FM-Indexsearches, they cannot fully utilize the available DRAM band-width either. Most NVM-based PIMs [14], [38], [39] focuson processing simple arithmetic computations of FM-Index byNVM arrays, but ignore optimizing external memory accesses.For instance, a ReRAM-based FM-Index PIM, FindeR [14],has only 2.6GB memory arrays, so it still suffers from lowDRAM bandwidth utilization when fetching FM-Index bucketsthat cannot ﬁt into its internal arrays from external DRAMs.Only a DRAM PIM, MEDAL [15], modiﬁes its DRAMmain memory for higher FM-Index search throughput. Insteadof processing multiple DNA symbols during a DRAM rowactivation, MEDAL enables chip-level parallelism where eachchip in a rank can independently process a DNA symbol byopening its partial row. In ideal case, 16 chips in a rankcan enlarge FM-Index search throughput by × , and eachchip has only 1/16 row size. However, we ﬁnd the DDR4address bus shared by all chips in a rank seriously limits searchthroughput of MEDAL. The DDR4 address bus is only 17-bit wide [40]. During each access, the row activation and thecolumn access serially pass their addresses via the same 17- AA ... +1TG ... TTTC MAXΣ f i +1 increment2 ...3 6 MAX I f -1 MAX x f -1 ... MAX x f -1 x f -2 MAX x f -1 x f -2 | G | f i = Σ Σ f i +1 b Fig. 8. A 2-step EXMA table (

MAX means the end of increments of a k -mer; and f i indicates the number of increments of the i th k -mer). bit bus. As Figure 7 shows, MEDAL can sequentially activatea 1/16 partial row in chip ∼ . But when activating a 1/16partial row in chip , its row address ( R - A ) and the columnaddress of chip ( C - A ) compete for the same address bus.The activation in chip has to be delayed to CK . And idlebubbles appear in the DDR4 data bus. Because of address busconﬂicts, although MEDAL claims a × search throughputimprovement over a multi-core CPU baseline, we observe itimproves search throughput by only × .IV. EXMAWe ﬁrst create a row-buffer-friendly alternative to FM-Index, EXMA table, to process multiple DNA symbols in eachsearch iteration. And then, we present a Multi-task-Learning(MTL)-based index to accelerate searches over an EXMAtable. Compared to LISA learned index, the MTL-based indexis more accurate, although it has less parameters. A. EXMA Table

Data Structure . The major reason why the learned indexof LISA is not accurate is that it has to index |G|

IP-BWTentries. To reduce the problem size for a learned index, wepropose a novel data structure, EXMA table. In each row ofthe

Occ table, only one k -mer can increase its value by one.For instance, in the 2-step Occ table shown in Figure 4, only“AC” increases its value from 0 to 1 in the row 1. Based onthis observation, as Figure 8 shows, for each k -mer, an EXMAtable stores only the row numbers of the Occ table, where itsvalue increases. For example, for “AA”, its value increases inthe row 2, 3, 6, and others. We store these row number asthe increment s of “AA”. Totally, we have f “AA”s, f “AC”s, . . . , and f “TT”s in the Occ table, where f , . . . , f arenon-negative integers and P i =0 f i = |G| . So in the EXMAtable, each 2-mer has f i increments, where ≤ i ≤ . Weadd a symbol M AX to indicate the end of the incrementsof a k -mer, where M AX = |G| + 1 . The increments of an A A

C TTm AA m AC m TT pos...increments (a) Learned index k-mer, posAA AC TT... (b) MTL indexFig. 9. MTL-based EXMA index. EXMA table has the space complexity of O ( |G| log ( |G| )) . Fora 3G-base human genome, the increments occupy 12GB. Wecan consecutively store the increments of all k -mers to takeadvantage of the row buffer locality. Each k -mer needs a base to point to its ﬁrst increment. For instance, the base of “AC”is f + 1 indicating its ﬁrst increment is in the position f + 1 .Totally, a k -step EXMA table stores k bases, each of whichis related to a k -mer. Even if a k -mer has no increment, e.g.,“TC”, its base is set to M AX = |G| +1 . The bases of a k -stepEXMA table have the space complexity of O (4 k log ( |G| )) . Backward Search . Each iteration in a backward searchcomputes Equation 1. The entire

Count table costs onlyseveral bytes, so the bottleneck is the

Occ table lookups.To compute

Occ ( k - mer, pos ) in each search iteration, wecompare pos against all increments of the k -mer and ﬁnd theﬁrst increment larger than pos , where pos can be low or high .For instance, to compute Occ ( AA, , we ﬁrst read the baseof “AA”, which is 0. And then, we initialize a counter andstart a linearly search from the position 0 of “AA” to M AX .If an increment is smaller than 4, we increase the counter by1. When 6 is found, we stop, since > . At last, the countervalue is 2, i.e., Occ ( AA,

4) = 2 . Na¨ıve Adoption of Learned Index . Since all incrementsof each k -mer are sorted, similar to LISA, we can adoptlearned index [29] to perform only one comparison to compute Occ ( k - mer, pos ) in the best case. We build a learned indexmodel hierarchy for each k -mer that has > increments.As Figure 9(a) shows, similar to LISA [28], to build learnedindexes, we use a ﬁxed ratio between the number of parametersof a learned index model and the number of incrementsthat need to be indexed. As a result, if a k -mer has moreincrements, its learned index model has more parameters.For instance, the model of “TT” ( m T T ) owns more weightsand biases than that of “AA” ( m AA ), since “TT” has moreincrements. To compute Occ ( AA, pos ) in a search iteration,we ﬁrst read all the parameters of m AA , and input only pos to the model. If the prediction is not correct, we start a linearsearch from the predicted position to ﬁnd the correct position.However, since an EXMA table have to index totally |G| increments, the learned index of EXMA does not have higheraccuracy than that of LISA. Step . We tuned the step number k of an EXMA table to balance the DRAM overhead and searchthroughput. For a genome reference G , the size of incrementsin an EXMA table is constant, while the size of bases inthe table is proportional to k . Although a small k has fewbases, the search throughput is low. For instance, for a 3G-base human genome, if k = 2 , the bases require only 32-byte.But each time, only a 2-mer can be processed by a search Size (GB) step

SA index incre base (a) DRAM overhead v.s. step

L I S A - 2 1E X M A - 1 4E X M A - 1 5E X M A - 1 6E X M A - 1 7E X M A - 1 5 M

Norm. throughput (b) Throughput on CPU baselineFig. 10. The trade-off of an EXMA table (EXMA-X means X-step EXMA).(a) 15 “A”s (cid:2)(cid:3)(cid:1)(cid:3) (cid:2)(cid:1) (cid:1)(cid:2) (cid:2)(cid:1) (cid:1)(cid:2) (cid:1)(cid:3) (cid:1)(cid:2) (cid:1)(cid:2) (cid:2)(cid:3) (cid:1)(cid:2) (cid:1)(cid:2) (cid:1)(cid:2) (cid:1)(cid:2) (cid:2)(cid:3) (cid:1) (cid:1)(cid:8)(cid:4)(cid:9)(cid:5)(cid:7)(cid:5)(cid:8)(cid:10)(cid:12)(cid:2)(cid:3)(cid:6)(cid:11)(cid:5)(cid:12) (b) 7 “AC”s and “A” (c) 7 “AT”s and “G”Fig. 11. Increment distributions of 15-mers. iteration. In contrast, a large k improves the search throughputbut signiﬁcantly increases the number of bases and thus thesize of an EXMA table. For a human genome, as Figure 10(a)shows, a 15-step EXMA table ( ) costs 29.5GB, while a 16-step EXMA ( ) occupies 41.5GB. Increasing k from 15 to 16increases 12GB DRAM overhead. As Figure 10(b) describes,EXMA ( EXMA-15 ) achieves its best search throughput with15-step. Compared to

LISA-21 , EXMA-15 degrades searchthroughput by 7.3%, since it has a smaller step number.

B. Multi-task-Learning Index for EXMA

MTL-based Index . To more accurately approximate thecumulative distribution function of increments of each k -mer in an EXMA table with less parameters, we proposea Multi-Task-Learning [41]–[43] (MTL)-based index for theincrements of an EXMA table. As Figure 11 shows, theincrements of various k -mers in exhibit similarrandom distributions. Based on Stein’s paradox [44], it is moreaccurate to approximate many independent random distribu-tions using samples from all of them rather than approximatingthem separately. The MTL-based index is trained with theincrements of multiple k -mers to obtain superior accuracyover learning the increments of each k -mer independently.We adopt the hard-parameter-sharing MTL [43] that sharesa subset of parameters between the learned index models of k -mers having different numbers of increments. For instance,as Figure 9(b) shows, “AA” uses the smallest model ( m AA ) toindex its increments, since it has the fewest increments amongall 2-mers. “TT” has more increments, and thus a larger model( m T T ) which contains the entire m AA . Besides m AA , m T T comprises more levels of nodes to index its increments moreaccurately. Compared to the na¨ıve learned index, we add moreneurons in the hidden layers of each non-leaf node of a MTL-based index to accommodate two inputs, i.e., pos and k -mer.But the size of a MTL-based index is smaller that of the na¨ıvelearned index, since the k -mers share most parameters of aMTL-based index. Inference . The MTL-based index of an EXMA table effec-tively approximates

Occ ( k - mer, pos ) as p = F ( k - mer, pos ) ∗ f k - mer (3) -6 -5 -4 -3 -2 -1 increm ent Percentage (%) . % (b)(a) increm ent time breakdown search Fig. 12. Proﬁling

EXMA-15 : (a) increment where p is the predicted position of pos in the incrementsof the k -mer; F ( ., . ) is the neural network model hierarchyto estimate the probability to observe an increment ≤ pos ; f k - mer is the number of increments of the k -mer, and can bestored along the k -mer base. After each inference, we readboth p and p + 1 to check whether the prediction is correct.If not, we start a linear search to ﬁnd the correct position.Therefore, the accuracy of a MTL-based index decides searchthroughput of FM-Index, but has no impact on the quality ofﬁnal DNA mapping . A MTL-based index model hierarchy isa tree structure. To keep the index size in check, we deploysimple linear regression models [45] as leaf nodes in the modelhierarchy of the MTL-based index. A linear regression modelcontains only one weight and one bias. Each non-leaf nodeis a neural network having a fully-connected layer, each ofwhich contains 10 neurons with sigmoid activation. Training . The MTL-based index is built for the incrementsof p k -mers {T i } pi =1 . For a k -mer {T i } , its training datasetincludes f i increments { x i,j } f i j =1 and their positions { y i,j } f i j =1 .The learning function for the k -mer T i is deﬁned as w Ti x + b i .Based on [42], [43], [46], we formulate the loss function tolearn the relations between k -mers as min W , b p X i =1 β i f i f i X j =1 l ( w Ti x i,j + b i , y i,j ) (4)where W = ( w , . . . , w p ) ; b = ( b , . . . , b p ) T ; l ( · , · ) meansthe cross-entropy loss function; β i is the importance of k -mer {T i } . We trained the MTL-based index to minimize thisequation by an Adam optimizer. Similar to LISA [28], thetraining and testing of a MTL-based index use the samedataset, an EXMA table of a reference. Training a MTL-basedindex for a reference typically takes ∼ days. Once a MTL-based index is trained, billions of exact match operations canhappen on its reference. learn-256K learn-1M M TL-256K M TL-1M -1 Errors

Fig. 13. Prediction errors of learned and MTL indexes.

MTL-based Index Performance . We proﬁled the perf-ormance of

EXMA-15 equipped with a na¨ıve learned indexin Figure 12. As Figure 12(a) shows, 2.5E-5 % and 4E-6 % of 15-mers have 64K ∼ >

1M increments, respec-tively. However, searching the 15-mers having 64K ∼ >

1M increments consumes 36% and 20.5% of thesearch time respectively, as shown in Figure 12(b). Thisis because the na¨ıve learned index predicts a lot wrong

EXMA d e / c o m p r ss queueinfer. enginePE PEPE PE bu ff e r ...... ...... ❶ W , bias ❹ k-mer, index $k-mer ❷ base base $ pred. posincrmnts ❻ W , biask-mer, pos ❺ base ❸ k-mer MTL index D R A M cmpr. cmpr. basesEXMA table ... pos m e m . c t r l ❼ D M A c t r l p a g e m g n accelerator Fig. 14. The architecture of EXMA accelerator. positions, and the linear search overhead is large. The pre-diction errors of the na¨ıve learned index for the 15-mershaving 64K ∼ learn-256K ) and >

1M ( learn-1M )increments are shown in Figure 13. On average, we haveto search 917 and 2133 more increments to ﬁnd the correctone for the 15-mers having 64K ∼ >

1M incrementsrespectively. The MTL-based index greatly improves indexprediction accuracy by simultaneously learning from multiple15-mers having similar amounts of increments. The MTL-based index (

MTL ) further reduces the mean of predictionerrors to 45 and 182 for the 15-mers having 64K ∼ >

1M increments respectively. As a result, the MTL-basedindex (

EXMA-15M ) improves search throughput by over

LISA-21 with only a half number of parameters, as shownin Figure 10(b).

C. EXMA Accelerator

We propose a hardware accelerator to process search op-erations over an EXMA table using a MTL-based index. Weintegrate two caches for the bases and the MTL-based indexof an EXMA table to reduce unnecessary DRAM accesses.And then, we present EXMA scheduling to improve cachehit rate. We also introduce dynamic page policy to improveDRAM row buffer hit rate during searching the increments ofan EXMA table. At last, we create CHAIN compression toreduce the EXMA table size.

1) Accelerator Architecture

The architecture of our EXMA accelerator is shown inFigure 14. The kernel of the EXMA accelerator is an inferenceengine computing predictions of a MTL-based index. Weadopt the state-of-the-art Tangram neural network accelera-tor [47] as the inference engine. The Tangram acceleratorconsists of a number of processing elements (PEs) organizedin a 2D array. Each PE includes a simple ALU for multiply-accumulate (MAC) operations and a small register ﬁle of32B. A larger SRAM buffer is shared by all PEs. We addtwo small caches for the bases and MTL-based index of anEXMA table. Data fetched and stored by the accelerator goesthrough a de/compression unit (§IV-C4) that de/compressesthe bases and increments of an EXMA table. We integrate ascheduling queue into the EXMA accelerator to implement 2-stage EXMA scheduling (§IV-C2). At last, the dynamic pagemanagement (§IV-C3) switching between open and close pagepolicies is implemented in the CPU memory controller.

2) EXMA Scheduling

Poor Locality of MTL-based Index . The conventionalﬁrst-ready ﬁrst-come-ﬁrst-serve (FR-FCFS) policy is adoptedby almost all accelerators [13]–[15], [30], [37], [38] toschedule FM-Index searches. However, FR-FCFS signiﬁcantlydegrades the hit rates of our base cache and index cache. a) base arrayAAAAAAACTTTGTTTT (b) MTL index ... ... m ...m m m m ... (c) FR-FCFSmiss R TTTTR R R pos base cache missmissAAAA AAACTTTG TTTT missTTTG TTTT missmiss hit m ,m ,m miss m ,m ,m m ,m ,m index cache Fig. 15. The low cache hit rate of MTL-based index. ( b) the 2 nd stageR R R R miss hit miss m ,m ,m hit m ,m ,m m ,m ,m index cache missR R R R base cache hit missAAAA AAACAAAA AAAC hit TTTG TTTT(a) the 1 st stage Fig. 16. The 2-stage EXMA scheduling.

We show an example of FR-FCFS in Figure 15, where thereare 4 FM-Index requests in the form of [ k - mer, pos ] , i.e., R = [ T T T T, , R = [ AAAA, , R = [ T T T G, and R = [ AAAC, . As Figure 15(a) shows, all basesare stored consecutively in the lexicographical order of k -mers. Each base occupies 4-byte. So the bases of AAAA and

AAAC ( T T T C and

T T T T ) are stored in the same 64-byte. The MTL-based index used by four requests is shownin Figure 15(b). To predict the increments of pos m , m and m ( m , m and m ) are required. Assume the base cache cancontain only one 64-byte line, while the index cache can storethree index nodes. Four requests are scheduled by FR-FCFSin Figure 15(c). When R arrives, both caches are empty andthus have a miss. And then, the bases of “TTTG” and “TTTT”are stored in the base cache, while the index nodes of m , m and m used by R are stored in the index cache. For R ,both caches also suffer from a miss. The bases of “AAAA”and “AAAC” are fetched to the base cache, while m , m and m used by R are installed into the index cache. For R , thebase cache has a miss, but the index cache has a hit. At last,both cache have a miss for R . Totally, four misses happen inthe base cache, while three misses occur in the index cache. . We propose a 2-stage EXMA schedul-ing technique to enhance the hit rates of the base and indexcaches. Unlike FR-FCFS scheduling requests based on theiraddresses and order, EXMA re-orders the requests accordingto their data including k -mers and positions ( pos ). In theﬁrst stage, EXMA sorts the FM-Index requests based on their k -mers. By consecutively issuing FM-Index requests in thelexicographic order of their k -mers, the hit rate of the basecache increases, since the bases of the k -mers sorted in thelexicographic order tend to have stronger spatial locality. AsFigure 16(a) shows, the ﬁrst stage of EXMA scheduling issuesfour requests in the order of R , R , R , and R . The basecache has two hits and two misses, but the index cache has allfour misses. This is why EXMA needs to do the second stageof scheduling. During the second stage of EXMA scheduling,four requests are ranked according to their pos values. Throughconsecutively computing inferences of MTL indexes of the requests having small differences between their pos values,the index cache can expect more hits. This is because thesmaller difference the pos values of two requests exhibit,the more likely these two requests use the same MTL indexnodes during searches. As Figure 16(b) shows, the index cachehas two misses and two hits. Totally, the 2-stage EXMAscheduling has 4 misses and 4 hits. Implementation . Our 2-stage EXMA scheduling is imple-mented with the scheduling queue that is a content-addressablememory (CAM). A CAM can perform sorting operations [48].The k -mer and pos of a request can be stored in one row ofthe CAM. Each DNA symbol in a k -mer is denoted by 3 bits,since we need to encode $, A, C, T and G. Requests can besorted in the CAM based on their k -mers or pos values.

3) Dynamic Page Policy

Dynamic Page Policy . Prior FM-Index accelerators [13]–[15], [30], [37], [38] adopt close-page policy in their DRAMmain memories. They always pre-charge a DRAM row aftereach access, since conventional 1-step FM-Index searcheshave little spatial locality, as shown in Figure 6(a). On thecontrary, our EXMA table stores the increments of a k -merconsecutively in DRAM rows. As the algorithm of FM-Indexbackward search (line 2 ∼ k -mer but withdifferent position values. In a search iteration, we compute Occ ( k - mer, low ) and Occ ( k - mer, high ) . Both search theincrements of the same k -mer that are very likely stored in thesame row. So our accelerator asks the CPU memory controller(MC) to keep the DRAM row open after the ﬁrst request ina search iteration is processed, since we expect the secondrequest can hit in the row buffer. However, the row will bepre-charged, after the second is processed. Implementation . The dynamic page policy can be imple-mented in the CPU MC. When searching a request, if thereis another request pending in the scheduling queue of theaccelerator searches the same k -mer, the CPU MC keepsthe DRAM row open after the ongoing request completes.Otherwise, it pre-charges that DRAM row. The CPU MCmaintains a register to indicate whether all rows are closed,and another register to record which row is open for each bank.

4) CHAIN compression B ∆ I . The state-of-the-art cache compression technique,B ∆ I [49], breaks a 64-byte cache line into eight 8-byte datasections. As Figure 17(a) shows, to compress the cache line,B ∆ I records the ﬁrst data section ( data ) and stores onlythe difference ( ∆ i ) between each data section ( data i ) and data . To decompress a data section, B ∆ I simply calculates a t a Δ ... Δ Δ data i =data +Δ i (a) B ∆ I incr incr incr incr - - -incr Δ Δ Δ incr i =incr + Δ j j=1 Σ i Δ i =incr i - incr i-1 + + +incr incr incr incr compressdecompress (b) Compression and decompression of CHAINFig. 17. The CHAIN compression. data i = data +∆ i . B ∆ I typically reduces data size of SPEC-CPU2006 applications by ∼ . In these applications, datasections in a cache line are not sorted. CHAIN . Since both increments and bases of each k -mer aresorted and stored consecutively in a DRAM row, we believethey are more compressible than the data of SPEC-CPU2006applications. Therefore, we propose a novel compression tech-nique, CHAIN, to more aggressively compress an EXMAtable. As Figure 17(b) describes, to compress the incrementsin a 64B memory line, CHAIN ﬁrst stores the ﬁrst increment( incr ). And then, it stores the value difference ( ∆ i ) betweentwo consecutive increments. So we have ∆ i = incr i − incr i − .To decompress an increment incr i in a 64B memory line,CHAIN simply computes incr i = incr + P ij =1 ∆ j . Basesof an EXMA table can be (de)compressed in the same way. Implementation . The CHAIN compression and decompres-sion require only 64-bit adders. Multiple adders concurrentlyoperate for the CHAIN compression, while the CHAIN de-compression requires only one adder for accumulations. TheCHAIN decompression slightly prolongs FM-Index searchlatency but greatly increases FM-Index search throughput.

5) Putting all together

As Figure 14 describes, ❶ after receiving FM-Index requestsfrom the CPU, the accelerator stores them in its schedulingqueue and performs the ﬁrst stage scheduling. ❷ Based ontheir k -mers, the accelerator checks whether the bases ofthe requests stored in the queue are in the base cache ornot. ❸ If misses occur, DRAM accesses are issued to fetchthe bases. Otherwise, the accelerator does the second stagescheduling. ❹ The accelerator checks whether the MTL indexnodes required by the requests in the queue are in the indexcache or not, according to both k -mer and pos values. ❺ Ifmisses happen, DRAM accesses are issued to fetch MTL indexnodes. Otherwise, the inference engine computes with MTLindex nodes. ❻ Until the inference of a leaf node is ﬁnished,the accelerator issues a DRAM access to read the incrementin the predicted position. If the increment is correct, thecomputation of

Occ ( k - mer, pos ) is completed. Otherwise, itlinearly searches DRAM rows to ﬁnd the correct increment. ❼ All DRAM accesses from the EXMA accelerator are managedby its DMA controller communicating with the CPU MC,which decides whether to pre-charge opened rows based onthe requests in the scheduling queue of the accelerator.

TABLE IT

HE HARDWARE CONFIGURATION OF

EXMA.Component Description Area ( mm ) Energy/Op ( pJ )Infer. engine 4 × PE arrays 0.512 0.25Sch. queue SRAM, 128-bit ×

256 0.023 1.9Index cache SRAM, 32KB, 16-way 0.084 2.62Base cache eDRAM, 1MB, 8-way 0.667 17.2De/compress 32 64-bit adders 0.091 0.21Sch. & row 2-stage sch. & dyn. page 0.035 1.02DMA ctrl adopted from [52] 0.21 3.42EXMA accelerator: area 1.62 mm , and leakage 223.8 mW CPU . GHz , 16-core, MB LLC, 64 LLC MSHRsDRAM DDR4-2400, 384GB, 4 channels, 3 DIMMs / channel,main 4 ranks / DIMM, 2 bank groups / rank, 16 chips / rank,memory 2 banks / bank group, FR-FCFS, close-page, row size 2KBsystem 2 chips / data buffer, t RCD - t CAS - t RP : 16-16-16

6) System Integration

Our EXMA is connected to a CPU processor as a loosely-coupled non-coherent accelerator [50], [51] by a Network-on-Chip (NoC). EXMA accesses DRAM via two DMA-dedicatedplanes of the NoC, bypassing the cache hierarchy of the CPU.The EXMA data region must be ﬂushed from the CPU cachehierarchy before FM-Index searches start. We chose the non-coherent model [50] for better performance, since the memoryfootprint of FM-Index searches is always larger than the CPULLC capacity.

D. Design Overhead

The training overhead of a MTL-based index is shownin the Section of

Training in §IV-B. The hardware over-head of the EXMA accelerator is summarized in Table I.From [47], we adopted the inference engine, which runs at

M Hz and is synthesized with Synopsys nm genericlibrary. We quantized the model hierarchy of MTL indexwith 8-bit without accuracy degradation. A PE has an 8-bitMAC ALU and a 32B register ﬁle. The inference enginecontains 4 × PE arrays, each of which has a 16KB sharedSRAM buffer and an activation unit. The EXMA acceleratoralso includes a scheduling queue (SRAM CAM) with 512128-bit entries, a 32KB 16-way SRAM index cache, anda 1MB 8-way eDRAM base cache. We modeled the areaand power of memory components including registers, buffersand caches by CACTI [53]. We used the same DDR4-2400DRAM main memory conﬁguration as the recent FM-IndexPIM MEDAL [15]. The EXMA accelerator connects to fourDRAM channels, with a total 384GB capacity. We adoptedDRAMPower [54] to model the power consumption of ourDDR4 main memory.V. E

XPERIMENTAL M ETHODOLOGY

Simulation . We used gem5-aladdin [55] to model our CPUbaseline and our EXMA accelerator. The conﬁguration of ourCPU baseline is shown in Table I. We used McPAT [56] tocalculate the power consumption of the CPU processor. Weimplemented the main memory system using DRAMsim2 [57].

Accelerator Baselines . Besides CPU, we also ran the FM-Index search kernel on an Nvidia Tesla P100 GPU [58],a Stratix-V FPGA [30], and a nm ASIC design [37].We compared EXMA against recent FM-Index PIM designs, um an picea pinus gm ean05101520253035 throughput

EXM A-15 EX-acc EX-2stage EXM A norm. search

Fig. 18. The search throughput of EXMA (norm. to

CPU ). MEDAL [15] and FindeR [14]. We used gem5-gpu [59] tosimulate the GPU, and gem5-aladdin to model the computingunits of FPGA, ASIC and PIMs. All their DRAM mainmemories are implemented by DRAMsim2. The power dataof the GPU, FPGA, ASIC and PIMs are adopted from [14],[15], [30], [37], [58]. The power of DRAM main memories isalso modeled by DRAMPower.

Workloads . To evaluate EXMA, we adopted FM-Index-based genome analysis applications: BWA-MEM [12] for shortread alignment, MA [16] for long read alignment, SGA [24]for read assembly, ExactWordMatch [25] for annotation anda reference-based compression algorithm [26]. SGA for longreads uses the FM-Index-based error correcting scheme [33]to reduce errors.

Datasets . For alignment, annotation and compression, weused human (human, 3G-bp), picea glauca (picea, 20G-bp),and pinus lambertiana (pinus, 31G-bp) genomes as referencegenomes. To study short reads, we adopted DWGSim [60] togenerate 101-bp short reads with × coverage. To evaluatelong reads, we created long reads (with length of 1K-bp) byPBSIM [61]. The error proﬁles of reads is summarized inthe format of (name, mismatch%, insertion%, deletion%, totalerror%), i.e., (Illumina, 0.18%, 0.01%, 0.01%, 0.2%) [62],(PacBio, 1.50%, 9.02%, 4.49%, 15.01%), and (ONT 2D,16.50%, 5.10%, 8.40%, 30.0%) [10]. Schemes . The schemes we studied can be summarized as: • CPU : We ran

LISA-21 for FM-Index searches in genomeapplications on our CPU baseline. We also applied B ∆ I [49]compression on LISA data for three datasets. • EXMA-15 : EXMA-15 with the MTL-based index andCHAIN compression is used to replace

LISA-21 in CPU . • EX-acc : We ran

EXMA-15 on the EXMA accelerator. • EX-2stage : 2-stage scheduling is added to

EX-acc . • EXMA : Dynamic page policy is enabled on

EX-2stage .VI. R

ESULTS AND A NALYSIS

Throughput Comparison against

CPU . As Figure 18shows, we compare FM-Index search throughput of EXMAand

CPU by running the seeding of short read alignment, sinceFM-Index searches consume 99% of the seeding time in shortread alignment. Compared to

CPU , on average,

EXMA-15 improves search throughput by 80%. Our MTL-based indexachieves high accuracy on picea, since the increment distribu-tions of its different k -mers are more similar to each other. EX-acc improves search throughput by . × over CPU .Our EXMA accelerator can support more concurrent searchoperations, while

CPU has only a limited number of LLCMSHRs. Compared to

CPU , EX-2stage increases searchthroughput by × . Pinus with EX-2stage has the smallestthroughput improvement. Because the size of the pinus MTL- a lig n m e n ta s s e m b lya lig n m e n ta s s e m b lya lig n m e n ta s s e m b lya n n o ta tec o m p re s s g m e a n norm. speedup hum an picea pinusPacBioNanoporeIllum ina

Fig. 19. The speedup of EXMA in genome analysis (norm. to

CPU ). based index is the largest among 3 datasets, and thus its indexcache has the lowest hit rate. On average, EXMA increasessearch throughput by . × over CPU . Performance Comparison against

CPU . We report andcompare the speedup achieved by

EXMA in various genomeapplications in Figure 19, where we list 3 sets of “alignmentand assembly” for reads generated by Illumina, Nanopore,and PacBio respectively. For each application, although

EXMA obtains smaller FM-Index search throughput improvement onlarger datasets (Figure 18), e.g., pinus,

EXMA improves theapplication performance more signiﬁcantly on larger datasets.This is because

CPU consumes a larger portion of the execu-tion time of a genome analysis application to perform FM-Index searches when processing larger datasets that introducemore TLB and data cache misses.

EXMA achieves largerperformance improvement on alignment and assembly for Illu-mina, annotation, and compression, since FM-Index searchesdominate the execution of these applications. On average,

EXMA improves the performance of genome applications by . × ∼ . × , when processing various datasets. humpicpin humpicpin humpicpin humpicpin humpicpin humpicpin humpicpin humpicpin humpicpin DRAM -chip DRAM -IO EXM A-dyn EXM A-leak CPU

PacBioNanoporeIllum ina gm ean assem norm. energy align assemalign assemalign com preanno

Fig. 20. The energy reduction of EXMA in genome analysis (norm. to

CPU . DRAM-chip/IO indicates the energy of DRAM chips/DDR4 interface.EXMA-leak/dyn is the static/dynamic energy of the EXMA accelerator).

Energy Comparison against

CPU . We compare the energyreduction obtained by

EXMA in various genome applicationsin Figure 20, where we list 3 sets of “align(ment) andasse(mbly)” for reads generated by Illumina, Nanopore, andPacBio respectively. On average,

EXMA reduces total energyconsumption of genome analysis by ∼ whenprocessing different datasets. The major part of the energyreduction comes from voiding using the CPU processor dur-ing FM-Index searches. The more time FM-Index searchesconsume in a genome analysis application, the more energyreduction EXMA can achieve in that application. On average,the EXMA accelerator consumes only < of the totalenergy consumption of various genome applications. The vastmajority of energy consumption is consumed by the DRAMmain memory and the CPU handling non-FM-Index-searchparts in genome analysis applications. Comparison against Accelerators . We evaluated EXMAand compare it against various hardware accelerators includinga GPU [58], a FPGA [30], an ASIC [37], and two PIMs [14],[15] when processing pinus in Table II. Not all acceleratorscan run the whole genome applications, so we use “million

S I C G P U M E D A L E X M A

BW utilization

Fig. 21. Bandwidth utilization throughputnorm. search (D)IM M

Fig. 22. Design space exploration (norm. to

EXMA ) o rig in a l B ∆ I o rg in a l C H A IN size (GB)

BW T incr base IndLISA-21

Fig. 23. CHAIN on pinus

TABLE IIT

HE COMPARISON OF ACCELERATORS WHEN PROCESSING

PINUS . GPU FPGA [30] ASIC [37] MEDAL [15] FindeR [14] EXMAAlgorithm LISA-21 FM-2 FM-1 FM-1 FM-1 EXMA-15Mem (GB) 384 384 384 384 384 384Acc Power (W) 182 11 9.4 0.011 0.28 0.89Mem Power (W) 72 72 72 72 72 72Mbase/s 157 96 34 102 93

Mbase/s/Watt 0.61 0.6 0.42 1.42 1.28 base per second” (Mbase/s) as the performance metric to eval-uate only FM-Index search throughput. Different acceleratorsadopt different search algorithms. We implemented

LISA-21 on the Tesla P100 GPU. The FPGA design [30] supports

FM-2 (conventional 2-step) searches. EXMA performs

EXMA-15 searches. The other accelerators can conduct only

FM-1 searches. Since the capacity of internal memories of the GPU(16GB HBM) and the PIM FindeR (2.6GB ReRAM) is smallerthan the pinus data size. We provide the same DDR4 mainmemory conﬁguration shown in Table I to all accelerators.The power values of both the accelerator (Acc Power) andthe DDR4 main memory (Mem Power) are listed in Ta-ble II. The search performance is decided by two factors, i.e.,the memory bandwidth utilization, and the number of DNAsymbols searched during each iteration. For the bandwidthutilization, only EXMA supports dynamic page policy, whilethe other use only close page policy. MEDAL can supportchip-level parallelism, but its search throughput is limited bythe address bus. So EXMA achieves the highest bandwidthutilization. For the number of DNA symbols searched duringeach iteration, only the GPU and EXMA can process > DNAsymbols in each iteration. But the low accuracy of learnedindex makes the GPU to search many unnecessary IP-BWTentries to ﬁnd the correct one. Therefore, EXMA obtains thebest search throughput. The throughput per Watt of the GPUand FPGA designs is low, since the power consumption oftheir computing parts is not trivial. For the PIMs and EXMA,the power consumption is dominated by the DRAM mainmemory. Compared to the PIM MEDAL, EXMA improvessearch throughput per Watt by . × . Bandwidth utilization . Figure 21 shows the comparison ofbandwidth utilization, which is deﬁned as the ratio betweenthe data capacity fetched from DRAM and total DRAM band-width. ASIC using

FM-1 has only 26% of the total DRAMbandwidth, since it uses close-page policy and fetches only64B after activating a 2KB row. GPU implementing

LISA-21 fetches entire rows from host memory, so it achieves higherbandwidth utilization. MEDAL increases bandwidth utilizationby activating each individual chips. However, due to conﬂictson the address bus, MEDAL uses only 67% of the DRAMbandwidth. In contrast, EXMA obtains 91% bandwidth utiliza- tion by switching between close-page and open-page policies.

DIMM number . We studied the sensitivity of EXMA andMEDAL to the DIMM number in Figure 22. With 2 DIMMs,EXMA improves search throughput by 29% over MEDAL. Byhaving 3 DIMMs, EXMA linearly scales its throughput up by40%, since a single EXMA accelerator can maintain all theDIMMs. However, MEDAL increases its throughput by only14.5% with 3 DIMMs. Each MEDAL PIM accelerator sits ona DIMM. More DIMMs bring more ranks. MEDAL suffersfrom non-trivial inter-DIMM communication overhead. Whenthe number of DIMM increases to 4, the search throughput ofneither EXMA nor MEDAL increases signiﬁcantly. The databus (bandwidth utilization) of EXMA is saturated, while theaddress bus of MEDAL is saturated.

PE Array number . We show search throughput of EXMAwith a varying number of PE arrays in Figure 22. Two PEarrays of EXMA already achieve 89% of the search throughputof the conﬁguration with four PE arrays. This is because thecomputations of MTL-based indexes are not intensive. Furtherincreasing the PE array number to 8 increases search through-put by only 3% over the conﬁguration with four PE arrays.So we selected 4 PE arrays in our baseline conﬁguration.

CAM & Cache . We explored search throughput of EXMAwith varying CAM and base cache sizes in Figure 22. We usea CAM consisting of 256, 512, and 1024 entries to serve as thescheduling queue of EXMA. A 256-entry CAM cannot fullysatisfy 2-stage scheduling and scheduling for dynamic pagepolicy, and thus achieves only 77% of search throughput ofthe conﬁguration with a 512-entry CAM. Further increasingthe CAM entry number to 1K improves search throughputby only 2% over the conﬁguration with a 512-entry CAM.Compared to the index cache, the search throughput is moresensitive to the capacity of the base cache, since the globalbuffer and register ﬁle of PE arrays can temporarily storeMTL-based index nodes. We tried 512KB, 1MB and 2MBfor the base cache. The search throughput stops increasingsigniﬁcantly, when the base cache capacity reaches 1MB. Sowe selected a 512-entry CAM and a 1MB base cache in ourbaseline conﬁguration.

CHAIN . We show the compression result of CHAIN on pinus in Figure 23. Since the size of the IP-BWT table of

LISA-21 is proportional to its step number, the total data sizeof

LISA-21 (original) is . × larger than that of EXMA-15 (original). After B ∆ I compresses the data size of

LISA-21 to 50%, i.e., 152GB. On the contrary, CHAIN compresses thedata size of

EXMA-15 to only 25%, i.e., 40GB. We observedsimilar compression rates of B ∆ I and CHAIN on the othergenome datasets.II. C

ONCLUSION

Though state-of-the-art genome analysis adopts FM-Indexto process exact-matches, FM-Index is notorious of randommemory access patterns. In this paper, we ﬁrst present arow-buffer-friendly and highly-compressible EXMA table witha MTL-based index to process multiple DNA symbols byactivating a DRAM row during each search iteration. Andthen, we build a hardware accelerator to process FM-Indexsearches on a EXMA table. Compared to the state-of-the-artFM-Index PIM MEDAL, EXMA improves search throughputby . × , and enhances search throughput per Watt by . × .R EFERENCES[1] M. Schirmer, U. Z. Ijaz, R. D’Amore, N. Hall, W. T. Sloan, andC. Quince, “Insight into biases and sequencing errors for ampliconsequencing with the illumina miseq platform,”

Nucleic acids research ,vol. 43, no. 6, pp. e37–e37, 2015.[2] J. J. Mosher, B. Bowman, E. L. Bernberg, O. Shevchenko, J. Kan,J. Korlach, and L. A. Kaplan, “Improved performance of the pacbiosmrt technology for 16s rdna sequencing,”

Journal of microbiologicalmethods , vol. 104, pp. 59–60, 2014.[3] M. Eisenstein, “Oxford nanopore announcement sets sequencing sectorabuzz,”

Nature Biotechnology , vol. 30, no. 4, pp. 295–297, 2012.[4] J. D. Merker, A. M. Wenger, T. Sneddon, M. Grove, Z. Zappala,L. Fresard, D. Waggott, S. Utiramerur, Y. Hou, K. S. Smith et al. ,“Long-read genome sequencing identiﬁes causal structural variation in amendelian disease,”

Genetics in Medicine , vol. 20, no. 1, p. 159, 2018.[5] X. Ma, M. Mau, and T. F. Sharbel, “Genome editing for global foodsecurity,”

Trends in biotechnology , 2017.[6] T. Hoenen, A. Groseth, K. Rosenke, R. J. Fischer, A. Hoenen, S. D.Judson, C. Martellaro, D. Falzarano, A. Marzi, R. B. Squires et al. ,“Nanopore sequencing as a rapidly deployable ebola outbreak tool,”

Emerging infectious diseases , vol. 22, no. 2, p. 331, 2016.[7] J. Quick, N. D. Grubaugh, S. T. Pullan, I. M. Claro, A. D. Smith,K. Gangavarapu, G. Oliveira, R. Robles-Sikisaka, T. F. Rogers, N. A.Beutler et al. , “Multiplex pcr method for minion and illumina sequencingof zika and other virus genomes directly from clinical samples,” natureprotocols , vol. 12, no. 6, p. 1261, 2017.[8] N. Zhu, D. Zhang, W. Wang, X. Li, B. Yang, J. Song, X. Zhao, B. Huang,W. Shi, R. Lu et al. , “A novel coronavirus from patients with pneumoniain china, 2019,”

New England Journal of Medicine , 2020.[9] S. Canzar and S. L. Salzberg, “Short read mapping: An algorithmictour,”

Proceedings of the IEEE , vol. 105, no. 3, pp. 436–458, 2017.[10] Y. Turakhia, G. Bejerano, and W. J. Dally, “Darwin: A genomics co-processor provides up to 15,000x acceleration on long read assembly,” in

ACM Architectural Support for Programming Languages and OperatingSystems , 2018.[11] D. Fuijiki, A. Subramaniyan, T. Zhang, Y. Zheng, R. Das, D. Blaauw,and S. Narayanasamy, “Genax: A genome sequencing accelerator,” in

IEEE/ACM International Symposium on Computer Architecture , 2018.[12] H. Li, “Aligning sequence reads, clone sequences and assembly contigswith bwa-mem,” arXiv preprint arXiv:1303.3997 , 2013.[13] M. C. F. Chang, Y. T. Chen, J. Cong, P. T. Huang, C. L. Kuo,and C. H. Yu, “The smem seeding acceleration for dna sequencealignment,” in

IEEE International Symposium on Field-ProgrammableCustom Computing Machines FCCM , 2016, pp. 32–39.[14] F. Zokaee, M. Zhang, and L. Jiang, “Finder: Accelerating fm-index-based exact pattern matching in genomic sequences through reramtechnology,” in

International Conference on Parallel Architectures andCompilation Techniques , 2019, pp. 284–295.[15] W. Huangfu, X. Li, S. Li, X. Hu, P. Gu, and Y. Xie, “Medal: Scalabledimm based near data processing accelerator for dna seeding algorithm,”in

IEEE/ACM International Symposium on Microarchitecture , 2019, p.587–599.[16] M. Schmidt, K. Heese, and A. Kutzner, “Accurate high throughput align-ment via line sweep-based seed processing,”

Nature communications ,vol. 10, no. 1, pp. 1–10, 2019.[17] A. Nag, C. Ramachandra, R. Balasubramonian, R. Stutsman, E. Gi-acomin, H. Kambalasubramanyam, and P.-E. Gaillardon, “Gencache:Leveraging in-cache operators for efﬁcient sequence alignment,” in

IEEE/ACM International Symposium on Microarchitecture , 2019, pp.334–346.[18] R. Kaplan, L. Yavits, R. Ginosar, and U. Weiser, “A resistive camprocessing-in-storage architecture for dna sequence alignment,”

IEEEMicro , 2017.[19] A. Madhavan, T. Sherwood, and D. Strukov, “Race logic: A hardwareacceleration for dynamic programming algorithms,” in

IEEE/ACM In-ternational Symposium on Computer Architecture , 2014.[20] E. Rucci, C. Garcia, G. Botella, A. De Giusti, M. Naiouf, and M. Prieto-Matias, “Accelerating smith-waterman alignment of long dna sequenceswith opencl on fpga,” in

International Conference on Bioinformatics andBiomedical Engineering . Springer, 2017, pp. 500–511.[21] R. Luo, T. Wong, J. Zhu, C.-M. Liu, X. Zhu, E. Wu, L.-K. Lee, H. Lin,W. Zhu, D. W. Cheung et al. , “Soap3-dp: fast, accurate and sensitivegpu-based short read aligner,”

PloS one , vol. 8, no. 5, p. e65632, 2013.[22] M. Burrows and D. J. Wheeler, “A block-sorting lossless data compres-sion algorithm,” 1994.[23] N. Ahmed, K. Bertels, and Z. Al-Ars, “A comparison of seed-and-extend techniques in modern dna read alignment algorithms,” in

IEEEInternational Conference on Bioinformatics and Biomedicine , 2016, pp.1421–1428.[24] J. T. Simpson and R. Durbin, “Efﬁcient de novo assembly of largegenomes using compressed data structures,”

Genome research , vol. 22,no. 3, 2012.[25] J. Healy, E. E. Thomas, J. T. Schwartz, and M. Wigler, “Annotating largegenomes with exact word matches,”

Genome research , vol. 13, no. 10,pp. 2306–2315, 2003.[26] P. Prochazka and J. Holub, “Compressing similar biological sequencesusing fm-index,” in

Data Compression Conference , 2014, pp. 312–321.[27] A. Chacon, S. Marco-Sola, A. Espinosa, P. Ribeca, and J. C. Moure,“Boosting the fm-index on the gpu: Effective techniques to mitigaterandom memory access,”

IEEE/ACM Transactions on ComputationalBiology and Bioinformatics , vol. 12, no. 5, pp. 1048–1059, 2015.[28] D. Ho, J. Ding, S. Misra, N. Tatbul, V. Nathan, V. Md, and T. Kraska,“Lisa: Towards learned dna sequence search,” in

Workshop on Systemsfor ML at NeurIPS 2019 , 2019.[29] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis, “Thecase for learned index structures,” in

ACM International Conferenceon Management of Data , 2018, p. 489–504.[30] J. Arram, T. Kaplan, W. Luk, and P. Jiang, “Leveraging fpgas for acceler-ating short read alignment,”

IEEE/ACM Transactions on ComputationalBiology and Bioinformatics , 2017.[31] Y. C. Wu, C. H. Chang, J. H. Hung, and C. H. Yang, “A 135-mwfully integrated data processor for next-generation sequencing,”

IEEETransactions on Biomedical Circuits and Systems , 2017.[32] M. A. Quail, M. Smith, P. Coupland, T. D. Otto, S. R. Harris, T. R.Connor, A. Bertoni, H. P. Swerdlow, and Y. Gu, “A tale of threenext generation sequencing platforms: comparison of ion torrent, paciﬁcbiosciences and illumina miseq sequencers,”

BMC genomics , vol. 13,no. 1, p. 341, 2012.[33] J. R. Wang, J. Holt, L. McMillan, and C. D. Jones, “Fmlrc: Hybrid longread error correction using an fm-index,”

BMC bioinformatics , vol. 19,no. 1, p. 50, 2018.[34] Y.-T. Huang and Y.-W. Huang, “An efﬁcient error correction algorithmusing fm-index,”

BMC bioinformatics , vol. 18, no. 1, p. 524, 2017.[35] H. Li, “Exploring single-sample snp and indel calling with whole-genome de novo assembly,”

Bioinformatics , vol. 28, no. 14, pp. 1838–1844, 2012.[36] A. Chac´on, J. C. Moure, A. Espinosa, and P. Hern´andez, “n-step fm-index for faster pattern matching,”

Procedia Computer Science , vol. 18,pp. 70–79, 2013.[37] Y. Wang, X. Li, D. Zang, G. Tan, and N. Sun, “Accelerating fm-indexsearch for genomic data processing,” in

International Conference onParallel Processing , 2018.[38] S. Angizi, J. Sun, W. Zhang, and D. Fan, “Aligns: A processing-in-memory accelerator for dna short read alignment leveraging sot-mram,”in

ACM/IEEE Design Automation Conference , 2019, pp. 1–6.[39] S. Angizi, J. Sun, W. Zhang, and D. Fan, “Pim-aligner: A processing-in-mram platform for biological sequence alignment,” in

Design, Automa-tion Test in Europe Conference & Exhibition

CoRR , vol. abs/1706.05098, 2017.42] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncer-tainty to weigh losses for scene geometry and semantics,” in

IEEEconference on computer vision and pattern recognition , 2018, pp. 7482–7491.[43] H. Bilen and A. Vedaldi, “Integrated perception with recurrent multi-taskneural networks,” in

Advances in neural information processing systems ,2016, pp. 235–243.[44] C. Stein, “Inadmissibility of the usual estimator for the mean of amultivariate normal distribution,” Stanford University, Tech. Rep., 1956.[45] M. H. Kutner, C. J. Nachtsheim, J. Neter, W. Li et al. , Applied linearstatistical models . McGraw-Hill Irwin New York, 2005, vol. 5.[46] A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature learning,”in

Advances in neural information processing systems , 2007, pp. 41–48.[47] M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “Tangram:Optimized coarse-grained dataﬂow for scalable nn accelerators,” in

ACMInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems , 2019, pp. 807–820.[48] I. Okabayashi, H. Kotani, and H. Kadota, “A proposed structure of a 4mbit content-addressable and sorting memory,” in

Symposium on VLSICircuits Digest of Technical Papers , 1990, pp. 109–110.[49] G. Pekhimenko, V. Seshadri, O. Mutlu, M. A. Kozuch, P. B. Gibbons,and T. C. Mowry, “Base-delta-immediate compression: Practical datacompression for on-chip caches,” in

IEEE International Conference onParallel Architectures and Compilation Techniques , 2012.[50] K. Bhardwaj, M. Havasi, Y. Yao, D. M. Brooks, J. M. H. Lobato, andG. Wei, “Determining optimal coherency interface for many-acceleratorsocs using bayesian optimization,”

IEEE Computer Architecture Letters ,vol. 18, no. 2, pp. 119–123, 2019.[51] D. Giri, P. Mantovani, and L. P. Carloni, “Accelerators and coherence:An soc perspective,”

IEEE Micro , vol. 38, no. 6, pp. 36–45, 2018.[52] G. Ma and H. He, “Design and implementation of an advanced dmacontroller on amba-based soc,” in

IEEE International Conference onASIC , 2009, pp. 419–422.[53] N. P. Jouppi, A. B. Kahng, N. Muralimanohar, and V. Srinivas, “Cacti-io: Cacti with off-chip power-area-timing models,”

IEEE Transactionson Very Large Scale Integration Systems , vol. 23, no. 7, pp. 1254–1267,2014.[54] K. Chandrasekar, C. Weis, Y. Li, B. Akesson, N. Wehn, and K. Goossens,“Drampower: Open-source dram power & energy estimation tool,”2012. [Online]. Available: https://github.com/tukl-msd/DRAMPower[55] Y. S. Shao, S. L. Xi, V. Srinivasan, G.-Y. Wei, and D. Brooks,“Co-designing accelerators and soc interfaces using gem5-aladdin,” in

IEEE/ACM International Symposium on Microarchitecture (MICRO) ,2016, pp. 1–12.[56] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, andN. P. Jouppi, “Mcpat: an integrated power, area, and timing modelingframework for multicore and manycore architectures,” in

IEEE/ACMInternational Symposium on Microarchitecture , 2009, pp. 469–480.[57] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “Dramsim2: A cycleaccurate memory system simulator,”

IEEE computer architecture letters ,vol. 10, no. 1, pp. 16–19, 2011.[58] NVIDIA, “Nvidia tesla p100,” 2016. [Online]. Available:https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf[59] J. Power, J. Hestness, M. Orr, M. Hill, and D. Wood, “gem5-gpu:A heterogeneous cpu-gpu simulator,”

Computer Architecture Letters ,vol. 13, no. 1, Jan 2014.[60] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer,G. Marth, G. Abecasis, and R. Durbin, “The sequence alignment/mapformat and samtools,”

Bioinformatics , vol. 25, no. 16, pp. 2078–2079,2009.[61] Y. Ono, K. Asai, and M. Hamada, “Pbsim: Pacbio reads simula-tor—toward accurate genome assembly,”

Bioinformatics , vol. 29, no. 1,pp. 119–121, 2012.[62] M. Schirmer, R. D’Amore, U. Z. Ijaz, N. Hall, and C. Quince, “Illuminaerror proﬁles: resolving ﬁne-scale variation in metagenomic sequencingdata,”