[PDF] A Memory-Efficient FM-Index Constructor for Next-Generation Sequencing Applications on FPGAs

Abstract

FM-index is an efficient data structure for string search and is widely used in next-generation sequencing (NGS) applications such as sequence alignment and de novo assembly. Recently, FM-indexing is even performed down to the read level, raising a demand of an efficient algorithm for FM-index construction. In this work, we propose a hardware-compatible Self-Aided Incremental Indexing (SAII) algorithm and its hardware architecture. This novel algorithm builds FM-index with no memory overhead, and the hardware system for realizing the algorithm can be very compact. Parallel architecture and a special prefetch controller is designed to enhance computational efficiency. An SAII-based FM-index constructor is implemented on an Altera Stratix V FPGA board. The presented constructor can support DNA sequences of sizes up to 131,072-bp, which is enough for small-scale references and reads obtained from current major platforms. Because the proposed constructor needs very few hardware resource, it can be easily integrated into different hardware accelerators designed for FM-index-based applications.

Full PDF

AA Memory-Efﬁcient FM-Index Constructor forNext-Generation Sequencing Applications onFPGAs

Nae-Chyun Chen, Yu-Cheng Li and Yi-Chang LuGraduate Institute of Electronics Engineering, National Taiwan University, Taipei, Taiwan, 10617Email: [email protected], [email protected], [email protected]

Abstract —FM-index is an efﬁcient data structure for stringsearch and is widely used in next-generation sequencing (NGS)applications such as sequence alignment and de novo assembly.Recently, FM-indexing is even performed down to the readlevel, raising a demand of an efﬁcient algorithm for FM-indexconstruction. In this work, we propose a hardware-compatibleSelf-Aided Incremental Indexing (SAII) algorithm and its hard-ware architecture. This novel algorithm builds FM-index withno memory overhead, and the hardware system for realizingthe algorithm can be very compact. Parallel architecture and aspecial prefetch controller is designed to enhance computationalefﬁciency. An SAII-based FM-index constructor is implementedon an Altera Stratix V FPGA board. The presented constructorcan support DNA sequences of sizes up to 131,072-bp, whichis enough for small-scale references and reads obtained fromcurrent major platforms. Because the proposed constructor needsvery few hardware resource, it can be easily integrated intodifferent hardware accelerators designed for FM-index-basedapplications.

Keywords -next-generation sequencing, FM-index, Burrows-Wheeler transform, self-aided index construction

I. I

NTRODUCTION

Burrows-Wheeler transform (BWT) [1] is a string rearrange-ment algorithm ﬁrst proposed for data compression. Makinguse of the properties between BWT and its original string,FM-index [2] is designed for efﬁcient string searching. Fora certain string, the data structure of its FM-index containsthe BWT, the sufﬁx array (SA) and two auxiliary tables ofthe target string. With its high efﬁciency in both time andmemory, FM-index is widely adopted by many next-generationsequencing (NGS) applications [3], [4], [5], [6].For FM-index-based sequence aligners such as BWA-backtrack [3] and Bowtie2 [5], the reference sequence isindexed for fast read-locating processes. For this kind ofaligners, because only the reference sequence needs to beindexed, the hardware accelerators ([7], [8]) usually buildthe FM-index externally using CPUs. Since the length of areference sequence can be as large as three billion base pairs(bps), the memory usage of naive index constructing methodis unaffordable. Therefore, how to reduce memory usage inFM-indexing becomes an important issue [9], [10].Recently, some de novo assemblers ([6], for example) applyFM-index for overlap ﬁnding of reads. Also, the referenceand reads are both indexed in new sequence aligners suchas BWA-SW [4]. For these applications, external construction of FM-index has much higher time overhead in comparisonwith traditional algorithms. Therefore, on-chip indexing be-comes necessary and important. Our previous research [11]has demonstrated an FM-index constructor with a lightweightiterative algorithm [10] with an ASIC, but the cost of theindexer is still high when compared to the work proposedhere.In this work, we propose a novel memory efﬁcient FM-index construction algorithm, Self-Aided Incremental Indexing(SAII), which is suitable for hardware realization. This algo-rithm builds the FM-index incrementally. In each iteration, itutilizes a meta-index to construct the complete index. SinceSAII makes use of the FM-index itself for construction, it hasno memory overhead for FM-index-based applications. Onlyfew computational logic units are needed. In our hardwaresystem, the processing speed is accelerated by a specialprefetch mechanism and a parallel architecture. We chooseAltera Stratix V FPGA as the evaluation platform. The SAIIFM-index constructor is very compact in terms of logic usage,so it can be integrated with other functional blocks to form acomplete hardware pipeline in emerging NGS applications.II. B

ACKGROUND

A. Burrows-Wheeler Transform

To construct the BWT of target sequence X , a simplemethod is via the translation of sufﬁx array ( SA ) with Eq. 1.Also, BWT can be obtained by collecting all characters in thelast column of sorted sufﬁxes. Fig. 1 shows an example of theBWT of sequence X =ACGATTG$, where character $ is theend-of-string character. The lexical order is $ < A < C < G < T. BW T [ i ] = (cid:40) X [ SA [ i ] − if SA [ i ] > if SA [ i ] = 0 (1) B. FM-index and Backward Search Algorithm

FM-index is extended from BWT and sufﬁx array, with twoauxiliary tables— C array and O table. The deﬁnitions for C array and O table are shown in Eq. 2 and 3, where n is thelength of target sequence X . C ( a ) ≡ size { ≤ j ≤ n − X [ j ] < a } (2) a r X i v : . [ c s . A R ] F e b BWT O table

10 2 7654370136254 5310 TGCA C arraySorted sufﬁxes table SA Fig. 1. The FM-index of target string ACGCTTG$. This data structureincludes

SA, BW T, C array and O table. BW T is the last column of thesorted sufﬁxes table. O ( a, i ) ≡ (cid:40) size { ≤ j ≤ i : BW T [ j ] = a } , i ≥ . , otherwise.(3)With C array and O table, the position of query aW canbe efﬁciently located within a lower bound R ( aW ) and upperbound R ( aW ) as shown in Eq. 4 and 5. R ( aW ) = C ( a ) + O ( a, R ( W ) −

1) + 1 (4) R ( aW ) = C ( a ) + O ( a, R ( W )) (5)In [2], Ferragina and Manzini have proved that R ( aW ) ≤ R ( aW ) if and only if aW is a substring of X . This searchingalgorithm starts from the end of the query sequence andextends iteratively. Therefore it is also called backward searchalgorithm.III. S ELF -A IDED I NCREMENTAL I NDEXING (SAII)A

LGORITHM

An example of backward search algorithm is shown inFig. 2. The initial values { R ( ∅ ) , R ( ∅ ) } are set at { , n − } .Here we discuss the mathematical insights of the lower boundin backward search algorithm. R ( aW ) is the sum of C ( a ) , O ( a, R ( W ) − and 1. C ( a ) records the total number ofcharacters lexically smaller than a in target sequence X . The O ( a, R ( W ) − term is the occurrence of aW in X . With anadditional offset, R ( aW ) represents the sufﬁx array index oflexically smallest aW sequence. Similar concept can be usedto account for the upper bound. If aW only occurs once in X , R ( aW ) is equal to R ( aW ) . It is also of interest that whatwould happen if aW is not a substring of X . Since R ( aW ) measures the occurrences of lexically smaller substrings in X ,the lower bound guarantees the following inequality: suf f ix SA [ R ( aW ) − < aW < suf f ix SA [ R ( aW )] (6)SAII utilizes Eq. 6 to build FM-index incrementally. Fora target sequence X , if the FM-index of its substring L = a i a i +1 ...a n − $ has been constructed, the sufﬁx array indexof the query string a i − L = a i − a i a i +1 ...a n − $ can bedetermined by calculating R ( a i − L ) . Since a i − L could notbe a substring of L , the sufﬁx array index obtained is unique $CA A$ GGC ATC G$G TG C CT G TT T C BWT

SA index

C(T)=50+1

R(T)R(T)

O(T,7)=2O(T,0)=02 (a) $CA A$ GGC ATC G$G TG C CT G TT T C

BWT

SA index

C(C)=11+1

R(CT)

O(C,7)=2O(C,5)=12

R(CT) (b)Fig. 2. An example of searching query sequence CT on indexed target string X =ACGCTTG$. (a) and (b) show the ﬁrst and second iteration respectively. and follows Eq. 6. Therefore, we can insert a i − into the FM-index of L , generating the new FM-index of a i − L withoutsorting the whole string all over again.The algorithm of SAII is shown in Alg. 1, and an exampleof a target sequence ACGCT$ is provided in Fig. 3. In theﬁrst iteration, the initial character is $ and the BWT is also $ .The corresponding O table and R are calculated. In the seconditeration, a new character T is added to the target sequence.Then we use O and C obtained in the previous iteration tocalculate the new R , and the updated $ is inserted to this R position to form a new BW T . With this updated

BW T , werecalculate O and C for this iteration. Then SAII is readyto move on to next iteration. After all the characters in thetarget sequence are read, the corresponding FM-index of thereference is correctly built.Because the construction process is entirely based on FM-index itself, nearly no extra computational resources areneeded and the memory overhead is zero for FM-index-based applications. The time complexity of SAII algorithmis O ( n k − ) , where k is completeness of the O table. Thedetails of k is given in Sec. IV-B. Algorithm 1

Self-Aided Incremental Indexing Algorithm

Require: target sequence X Ensure:

BW T , C array and O table Initialize C array and O table n ← length ( X ) BW T ← $ q ← for i from n − to do BW T [ q ] ← X [ i ] L ← X [ i : n − q ← C ( X [ i ]) + O ( X [ i ] , R ( L ) −

1) + 1 BW T ← BW T [0 : q −

1] + $ +

BW T [ q : n − Update C array with L Update O table with BW T end for

IV. H

ARDWARE I MPLEMENTATION AND D ISCUSSION

A. Overall Architecture

The hardware system includes a ﬁnite-state machine con-troller and a combinational computing logic. The ﬁnite-state /0/0/ C O R X T$CT$ 0+0+1=1GCT$ 1+0+1=2CGCT$ 0+0+1=1 T$

BWT

T$CTG$CT$GCCA/C/G/T A/C/G/T$ 0/ /0/10/0/0/1$T0/0/ /1 0/0/ /10/0/0/10/1/0/1$TC0/ /1/2 0/0/0/10/ /1/10/0/1/10/1/1/1$TCGinit: $ 0/ /0/0 init: $ t i m e li ne init: 0 /0/2/3 /0/0/10/0/0/10/0/1/10/1/1/10/2/1/1$TCGC TCTGCTTGCC UpdateSearch InsertUpdateSearch InsertUpdateSearch InsertUpdateSearch Insert

ACGCT$ 0+0+1=1 T$AGCC0/1/3/4 0/0/0/10/0/0/11/0/1/11/1/1/11/2/1/1$TCGC TAGCC

UpdateSearch Insert

Fig. 3. An example of the FM-indexing of a target sequence ACGCT$ withSAII algorithm. machine of the proposed hardware is shown in Fig. 4. Thereare four states in this system: Initial, Search, Update and Finishstates.Initial and Finish states control the initial and ﬁnishingconditions of the hardware. Search state computes the lowerbound R ( aW ) with Eq. 4. A two-stage parallel pop counter isused in Search state for fast computing. Update State updatesthe latest incoming symbol to the $ position in previousiteration. Also, it inserts the $ symbol to the updated indexbased on the position calculated in Search state. The wholeFM-index data structure including BW T , C array and O tableall need to be updated in this state. The ﬁnite-state machinerepeats between Search and Update states to construct the FM-index and moves to Finish state after the last character of thetarget sequence is processed. B. Data Structure

For DNA aligners, the size of alphabets ( Σ ) is ﬁve (in-cluding $). Since symbol $ occurs only once, in our hardwaresystem only A,C,G, and T are encoded. End-of-string character$ is encoded the same as the symbol, A, and an additionalspecial pointer is designed to store the position of $. Thisdesign uses only two bits for each character, which is only67% in comparison to that of naive 3-bit encoding. InitialSearchUpdate & InsertFinish if all inputs processed

Fig. 4. The ﬁnite-state machine controlling the hardware system.

The memory usage of the three main components of FM-index, C array, O table and BW T , are n , n log n and n , respectively. However, the length of genome sequence datais sometimes very large. The length of a human chromosomecan exceed 200 Mbp and the whole human genome is morethan 3 Gbp. With this scale of data, the n log n memory usageof a complete O table is very expensive in hardware systems.It should be noted that O table is a hash table obtained from BW T designed for fast computation of R and R . The correctbounds can still be calculated even without O table at the costof searching efﬁciency.In our hardware system, incomplete O table is used toachieve balance between memory usage and computing speed.An incomplete O table stores the occurrence values at every k characters. It is k times smaller than a complete table. Withan incomplete O table, the calculation of O ( a, i ) is split intotwo parts. First, O ( a, k (cid:4) ik (cid:5) ) is stored in the incomplete table.Second, the occurrences of a from k (cid:4) ik (cid:5) to i is calculatedwith a pop counter. In our hardware implementation, k isset at 2,048. The pop counter is designed with a two-stagearchitecture, in which the ﬁrst stage has 32 parallel addersand the second stage has 64 parallel adders. The incomplete O table can be adjusted for different applications with simplemodiﬁcation of parameters. C. Constructing BWT with Prefetch Mechanism

In our hardware system, Search and Update states are incharge of the construction of FM-index. To save computingresource,

BW T and O table are both segmented and stored inthe BRAM. Though this saves lots of area, it takes longer timeto update the BRAM due to the ﬁxed word length. Therefore,how to make use of the limited bandwidth is very important.To address this issue, we design a prefetch mechanism thatsaves 50% time of BRAM updating. The timing diagramwithout prefetching is shown in Fig. 5(a). In Search state, R is calculated; in Update state, the incoming character isupdated to the FM-index as shown in Line 6 in Alg. 1; inInsert state, the $ is inserted to the FM-index, as shown inLine 9 in Alg. 1. In both Update and Insert states, BW T and O table in the BRAM have to be refreshed, so the processingtime is long.Prefetch mechanism is designed to reduce the runtime to Initialize Insert $ Update 2 character 1 character 2

S Insert $ Update 3 character 3

S Insert $ (a)

Insert $S’ Update 2Initialize character 1

Insert $S’ Update 2 Insert $S’ Update 2 Insert $S’ Update 2 S’ character 2 character 3 character 4 character 5 (b)Fig. 5. Timing diagrams for: (a) constructing BWT without prefetch mechanism, and (b)constructing BWT with prefetch mechanism. The blue boxes (S- and S’-boxes) denote Searchstate. The S’-boxes contains the additional monitor. N u m b e r o f c y c l e s theoreticalactual Fig. 6. Processing cycle count of SAII hardware system. refresh BRAM. As shown in Fig. 3, SAII algorithm ﬁrstreplaces the $ in BW T with the new character, calculates thenew insertion position of $ and inserts new $ to the BW T .With prefetch mechanism, the insertion of $ is combined withnext Update state and executed after the hardware systemsees the next character. This does not generate the completelycorrect FM-index yet because of its early update design.Therefore, the early updated position and character have totracked with an additional monitor to make sure the calculationof R is correct in the next iteration. In the last iteration, sincethere is no more character for prefetching, the ﬁnal FM-indexis correct. Fig. 5(b) shows the timing diagram of the hardwaresystem with prefetching.Assume that one update takes q cycles and one backwardsearch takes m cycles ( q >> m ), prefetch mechanism reducesthe computing time from q + m to q + m for each iteration.The computing time is nearly half of the original design. Theexpected runtime ( T ) of the SAII hardware system is shownin Eq. 7, where m stands for search time and is set to 3 cyclesin our implementation. T = k nk (cid:88) i =1 ( m + i (7) D. Discussion

We implement our SAII FM-index constructor on an AlteraStratix V FPGA (5SGXEA7N2F45C2N) board. The hardwaresystem is synthesized using Altera Quartus (v.15.0) tool. Itonly uses 21,944 ALMs (9%) and 266,496 BRAMs ( < ◦ C). Sequences with differentlengths, from 16,384-bp to 131,072-bp, are tested and theresults are shown in Fig. 6. It takes about 21 ms to ﬁnishthe indexing of a 131,072-bp sequence. The runtime is veryclose to our theoretical estimation given in Eq. 7. Even forgenomes with several million base pairs, it is estimated thatour system can construct the FM-index in seconds. V. C

ONCLUSIONS

With many emerging applications based on FM-index, anefﬁcient index construction algorithm is needed. Previous al-gorithms ([9], [10]) need additional working space to build theindex, raising the costs of hardware systems. In this paper, wepropose a novel hardware-compatible Self-Aided IncrementalIndexing (SAII) algorithm to construct FM-index with nomemory overhead. This algorithm is accelerated with a parallelpop counter and a special prefetch mechanism. Its realizationon FPGA needs very few hardware resources and can be easilyintegrated in different FM-index-based applications.A

CKNOWLEDGMENT

This work is supported by the Ministry of Science andTechnology, Taiwan, under Grant numbers MOST 105-2221-E-002-090 and 106-2221-E-002-055. Nae-Chyun Chen wouldlike to thank NOVATEK for providing fellowship.R

EFERENCES[1] M. Burrows and D. J. Wheeler, “A block-sorting lossless data compres-sion algorithm,” 1994.[2] P. Ferragina and G. Manzini, “Opportunistic data structures with appli-cations,” in

Foundations of Computer Science, 2000. Proceedings. 41stAnnual Symposium on . IEEE, 2000, pp. 390–398.[3] H. Li and R. Durbin, “Fast and accurate short read alignment withburrows–wheeler transform,”

Bioinformatics , vol. 25, no. 14, pp. 1754–1760, 2009.[4] ——, “Fast and accurate long-read alignment with burrows–wheelertransform,”

Bioinformatics , vol. 26, no. 5, pp. 589–595, 2010.[5] B. Langmead and S. L. Salzberg, “Fast gapped-read alignment withbowtie 2,”

Nature methods , vol. 9, no. 4, pp. 357–359, 2012.[6] J. T. Simpson and R. Durbin, “Efﬁcient construction of an assemblystring graph using the fm-index,”

Bioinformatics , vol. 26, no. 12, pp.i367–i373, 2010.[7] C. B. Olson, M. Kim, C. Clauson, B. Kogon, C. Ebeling, S. Hauck, andW. L. Ruzzo, “Hardware acceleration of short read mapping,” in

Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20thAnnual International Symposium on . IEEE, 2012, pp. 161–168.[8] H. M. Waidyasooriya and M. Hariyama, “Hardware-acceleration ofshort-read alignment based on the burrows-wheeler transform,”

IEEETransactions on Parallel and Distributed Systems , vol. 27, no. 5, pp.1358–1372, 2016.[9] D. Okanohara and K. Sadakane, “A linear-time burrows-wheeler trans-form using induced sorting.” in

SPIRE , vol. 5721. Springer, 2009, pp.90–101.[10] P. Ferragina, T. Gagie, and G. Manzini, “Lightweight data indexing andcompression in external memory,”

Algorithmica , vol. 63, no. 3, pp. 707–730, 2012.[11] N.-C. Chen, T.-Y. Chiu, Y.-C. Li, Y.-C. Chien, and Y.-C. Lu, “Powerefﬁcient special processor design for burrows-wheeler-transform-basedshort read sequence alignment,” in