A Memory-Efficient FM-Index Constructor for Next-Generation Sequencing Applications on FPGAs
AA Memory-Efficient FM-Index Constructor forNext-Generation Sequencing Applications onFPGAs
Nae-Chyun Chen, Yu-Cheng Li and Yi-Chang LuGraduate Institute of Electronics Engineering, National Taiwan University, Taipei, Taiwan, 10617Email: [email protected], [email protected], [email protected]
Abstract —FM-index is an efficient data structure for stringsearch and is widely used in next-generation sequencing (NGS)applications such as sequence alignment and de novo assembly.Recently, FM-indexing is even performed down to the readlevel, raising a demand of an efficient algorithm for FM-indexconstruction. In this work, we propose a hardware-compatibleSelf-Aided Incremental Indexing (SAII) algorithm and its hard-ware architecture. This novel algorithm builds FM-index withno memory overhead, and the hardware system for realizingthe algorithm can be very compact. Parallel architecture and aspecial prefetch controller is designed to enhance computationalefficiency. An SAII-based FM-index constructor is implementedon an Altera Stratix V FPGA board. The presented constructorcan support DNA sequences of sizes up to 131,072-bp, whichis enough for small-scale references and reads obtained fromcurrent major platforms. Because the proposed constructor needsvery few hardware resource, it can be easily integrated intodifferent hardware accelerators designed for FM-index-basedapplications.
Keywords -next-generation sequencing, FM-index, Burrows-Wheeler transform, self-aided index construction
I. I
NTRODUCTION
Burrows-Wheeler transform (BWT) [1] is a string rearrange-ment algorithm first proposed for data compression. Makinguse of the properties between BWT and its original string,FM-index [2] is designed for efficient string searching. Fora certain string, the data structure of its FM-index containsthe BWT, the suffix array (SA) and two auxiliary tables ofthe target string. With its high efficiency in both time andmemory, FM-index is widely adopted by many next-generationsequencing (NGS) applications [3], [4], [5], [6].For FM-index-based sequence aligners such as BWA-backtrack [3] and Bowtie2 [5], the reference sequence isindexed for fast read-locating processes. For this kind ofaligners, because only the reference sequence needs to beindexed, the hardware accelerators ([7], [8]) usually buildthe FM-index externally using CPUs. Since the length of areference sequence can be as large as three billion base pairs(bps), the memory usage of naive index constructing methodis unaffordable. Therefore, how to reduce memory usage inFM-indexing becomes an important issue [9], [10].Recently, some de novo assemblers ([6], for example) applyFM-index for overlap finding of reads. Also, the referenceand reads are both indexed in new sequence aligners suchas BWA-SW [4]. For these applications, external construction of FM-index has much higher time overhead in comparisonwith traditional algorithms. Therefore, on-chip indexing be-comes necessary and important. Our previous research [11]has demonstrated an FM-index constructor with a lightweightiterative algorithm [10] with an ASIC, but the cost of theindexer is still high when compared to the work proposedhere.In this work, we propose a novel memory efficient FM-index construction algorithm, Self-Aided Incremental Indexing(SAII), which is suitable for hardware realization. This algo-rithm builds the FM-index incrementally. In each iteration, itutilizes a meta-index to construct the complete index. SinceSAII makes use of the FM-index itself for construction, it hasno memory overhead for FM-index-based applications. Onlyfew computational logic units are needed. In our hardwaresystem, the processing speed is accelerated by a specialprefetch mechanism and a parallel architecture. We chooseAltera Stratix V FPGA as the evaluation platform. The SAIIFM-index constructor is very compact in terms of logic usage,so it can be integrated with other functional blocks to form acomplete hardware pipeline in emerging NGS applications.II. B
ACKGROUND
A. Burrows-Wheeler Transform
To construct the BWT of target sequence X , a simplemethod is via the translation of suffix array ( SA ) with Eq. 1.Also, BWT can be obtained by collecting all characters in thelast column of sorted suffixes. Fig. 1 shows an example of theBWT of sequence X =ACGATTG$, where character $ is theend-of-string character. The lexical order is $ < A < C < G < T. BW T [ i ] = (cid:40) X [ SA [ i ] − if SA [ i ] > if SA [ i ] = 0 (1) B. FM-index and Backward Search Algorithm
FM-index is extended from BWT and suffix array, with twoauxiliary tables— C array and O table. The definitions for C array and O table are shown in Eq. 2 and 3, where n is thelength of target sequence X . C ( a ) ≡ size { ≤ j ≤ n − X [ j ] < a } (2) a r X i v : . [ c s . A R ] F e b BWT O table
10 2 7654370136254 5310 TGCA C arraySorted suffixes table SA Fig. 1. The FM-index of target string ACGCTTG$. This data structureincludes
SA, BW T, C array and O table. BW T is the last column of thesorted suffixes table. O ( a, i ) ≡ (cid:40) size { ≤ j ≤ i : BW T [ j ] = a } , i ≥ . , otherwise.(3)With C array and O table, the position of query aW canbe efficiently located within a lower bound R ( aW ) and upperbound R ( aW ) as shown in Eq. 4 and 5. R ( aW ) = C ( a ) + O ( a, R ( W ) −
1) + 1 (4) R ( aW ) = C ( a ) + O ( a, R ( W )) (5)In [2], Ferragina and Manzini have proved that R ( aW ) ≤ R ( aW ) if and only if aW is a substring of X . This searchingalgorithm starts from the end of the query sequence andextends iteratively. Therefore it is also called backward searchalgorithm.III. S ELF -A IDED I NCREMENTAL I NDEXING (SAII)A
LGORITHM
An example of backward search algorithm is shown inFig. 2. The initial values { R ( ∅ ) , R ( ∅ ) } are set at { , n − } .Here we discuss the mathematical insights of the lower boundin backward search algorithm. R ( aW ) is the sum of C ( a ) , O ( a, R ( W ) − and 1. C ( a ) records the total number ofcharacters lexically smaller than a in target sequence X . The O ( a, R ( W ) − term is the occurrence of aW in X . With anadditional offset, R ( aW ) represents the suffix array index oflexically smallest aW sequence. Similar concept can be usedto account for the upper bound. If aW only occurs once in X , R ( aW ) is equal to R ( aW ) . It is also of interest that whatwould happen if aW is not a substring of X . Since R ( aW ) measures the occurrences of lexically smaller substrings in X ,the lower bound guarantees the following inequality: suf f ix SA [ R ( aW ) − < aW < suf f ix SA [ R ( aW )] (6)SAII utilizes Eq. 6 to build FM-index incrementally. Fora target sequence X , if the FM-index of its substring L = a i a i +1 ...a n − $ has been constructed, the suffix array indexof the query string a i − L = a i − a i a i +1 ...a n − $ can bedetermined by calculating R ( a i − L ) . Since a i − L could notbe a substring of L , the suffix array index obtained is unique $CA A$ GGC ATC G$G TG C CT G TT T C BWT
SA index
C(T)=50+1
R(T)R(T)
O(T,7)=2O(T,0)=02 (a) $CA A$ GGC ATC G$G TG C CT G TT T C
BWT
SA index
C(C)=11+1
R(CT)
O(C,7)=2O(C,5)=12
R(CT) (b)Fig. 2. An example of searching query sequence CT on indexed target string X =ACGCTTG$. (a) and (b) show the first and second iteration respectively. and follows Eq. 6. Therefore, we can insert a i − into the FM-index of L , generating the new FM-index of a i − L withoutsorting the whole string all over again.The algorithm of SAII is shown in Alg. 1, and an exampleof a target sequence ACGCT$ is provided in Fig. 3. In thefirst iteration, the initial character is $ and the BWT is also $ .The corresponding O table and R are calculated. In the seconditeration, a new character T is added to the target sequence.Then we use O and C obtained in the previous iteration tocalculate the new R , and the updated $ is inserted to this R position to form a new BW T . With this updated
BW T , werecalculate O and C for this iteration. Then SAII is readyto move on to next iteration. After all the characters in thetarget sequence are read, the corresponding FM-index of thereference is correctly built.Because the construction process is entirely based on FM-index itself, nearly no extra computational resources areneeded and the memory overhead is zero for FM-index-based applications. The time complexity of SAII algorithmis O ( n k − ) , where k is completeness of the O table. Thedetails of k is given in Sec. IV-B. Algorithm 1
Self-Aided Incremental Indexing Algorithm
Require: target sequence X Ensure:
BW T , C array and O table Initialize C array and O table n ← length ( X ) BW T ← $ q ← for i from n − to do BW T [ q ] ← X [ i ] L ← X [ i : n − q ← C ( X [ i ]) + O ( X [ i ] , R ( L ) −
1) + 1 BW T ← BW T [0 : q −
1] + $ +
BW T [ q : n − Update C array with L Update O table with BW T end for
IV. H
ARDWARE I MPLEMENTATION AND D ISCUSSION
A. Overall Architecture
The hardware system includes a finite-state machine con-troller and a combinational computing logic. The finite-state /0/0/ C O R X T$CT$ 0+0+1=1GCT$ 1+0+1=2CGCT$ 0+0+1=1 T$
BWT
T$CTG$CT$GCCA/C/G/T A/C/G/T$ 0/ /0/10/0/0/1$T0/0/ /1 0/0/ /10/0/0/10/1/0/1$TC0/ /1/2 0/0/0/10/ /1/10/0/1/10/1/1/1$TCGinit: $ 0/ /0/0 init: $ t i m e li ne init: 0 /0/2/3 /0/0/10/0/0/10/0/1/10/1/1/10/2/1/1$TCGC TCTGCTTGCC UpdateSearch InsertUpdateSearch InsertUpdateSearch InsertUpdateSearch Insert
ACGCT$ 0+0+1=1 T$AGCC0/1/3/4 0/0/0/10/0/0/11/0/1/11/1/1/11/2/1/1$TCGC TAGCC
UpdateSearch Insert
Fig. 3. An example of the FM-indexing of a target sequence ACGCT$ withSAII algorithm. machine of the proposed hardware is shown in Fig. 4. Thereare four states in this system: Initial, Search, Update and Finishstates.Initial and Finish states control the initial and finishingconditions of the hardware. Search state computes the lowerbound R ( aW ) with Eq. 4. A two-stage parallel pop counter isused in Search state for fast computing. Update State updatesthe latest incoming symbol to the $ position in previousiteration. Also, it inserts the $ symbol to the updated indexbased on the position calculated in Search state. The wholeFM-index data structure including BW T , C array and O tableall need to be updated in this state. The finite-state machinerepeats between Search and Update states to construct the FM-index and moves to Finish state after the last character of thetarget sequence is processed. B. Data Structure
For DNA aligners, the size of alphabets ( Σ ) is five (in-cluding $). Since symbol $ occurs only once, in our hardwaresystem only A,C,G, and T are encoded. End-of-string character$ is encoded the same as the symbol, A, and an additionalspecial pointer is designed to store the position of $. Thisdesign uses only two bits for each character, which is only67% in comparison to that of naive 3-bit encoding. InitialSearchUpdate & InsertFinish if all inputs processed
Fig. 4. The finite-state machine controlling the hardware system.
The memory usage of the three main components of FM-index, C array, O table and BW T , are n , n log n and n , respectively. However, the length of genome sequence datais sometimes very large. The length of a human chromosomecan exceed 200 Mbp and the whole human genome is morethan 3 Gbp. With this scale of data, the n log n memory usageof a complete O table is very expensive in hardware systems.It should be noted that O table is a hash table obtained from BW T designed for fast computation of R and R . The correctbounds can still be calculated even without O table at the costof searching efficiency.In our hardware system, incomplete O table is used toachieve balance between memory usage and computing speed.An incomplete O table stores the occurrence values at every k characters. It is k times smaller than a complete table. Withan incomplete O table, the calculation of O ( a, i ) is split intotwo parts. First, O ( a, k (cid:4) ik (cid:5) ) is stored in the incomplete table.Second, the occurrences of a from k (cid:4) ik (cid:5) to i is calculatedwith a pop counter. In our hardware implementation, k isset at 2,048. The pop counter is designed with a two-stagearchitecture, in which the first stage has 32 parallel addersand the second stage has 64 parallel adders. The incomplete O table can be adjusted for different applications with simplemodification of parameters. C. Constructing BWT with Prefetch Mechanism
In our hardware system, Search and Update states are incharge of the construction of FM-index. To save computingresource,
BW T and O table are both segmented and stored inthe BRAM. Though this saves lots of area, it takes longer timeto update the BRAM due to the fixed word length. Therefore,how to make use of the limited bandwidth is very important.To address this issue, we design a prefetch mechanism thatsaves 50% time of BRAM updating. The timing diagramwithout prefetching is shown in Fig. 5(a). In Search state, R is calculated; in Update state, the incoming character isupdated to the FM-index as shown in Line 6 in Alg. 1; inInsert state, the $ is inserted to the FM-index, as shown inLine 9 in Alg. 1. In both Update and Insert states, BW T and O table in the BRAM have to be refreshed, so the processingtime is long.Prefetch mechanism is designed to reduce the runtime to Initialize Insert $ Update 2 character 1 character 2
S Insert $ Update 3 character 3
S Insert $ (a)
Insert $S’ Update 2Initialize character 1
Insert $S’ Update 2 Insert $S’ Update 2 Insert $S’ Update 2 S’ character 2 character 3 character 4 character 5 (b)Fig. 5. Timing diagrams for: (a) constructing BWT without prefetch mechanism, and (b)constructing BWT with prefetch mechanism. The blue boxes (S- and S’-boxes) denote Searchstate. The S’-boxes contains the additional monitor. N u m b e r o f c y c l e s theoreticalactual Fig. 6. Processing cycle count of SAII hardware system. refresh BRAM. As shown in Fig. 3, SAII algorithm firstreplaces the $ in BW T with the new character, calculates thenew insertion position of $ and inserts new $ to the BW T .With prefetch mechanism, the insertion of $ is combined withnext Update state and executed after the hardware systemsees the next character. This does not generate the completelycorrect FM-index yet because of its early update design.Therefore, the early updated position and character have totracked with an additional monitor to make sure the calculationof R is correct in the next iteration. In the last iteration, sincethere is no more character for prefetching, the final FM-indexis correct. Fig. 5(b) shows the timing diagram of the hardwaresystem with prefetching.Assume that one update takes q cycles and one backwardsearch takes m cycles ( q >> m ), prefetch mechanism reducesthe computing time from q + m to q + m for each iteration.The computing time is nearly half of the original design. Theexpected runtime ( T ) of the SAII hardware system is shownin Eq. 7, where m stands for search time and is set to 3 cyclesin our implementation. T = k nk (cid:88) i =1 ( m + i (7) D. Discussion
We implement our SAII FM-index constructor on an AlteraStratix V FPGA (5SGXEA7N2F45C2N) board. The hardwaresystem is synthesized using Altera Quartus (v.15.0) tool. Itonly uses 21,944 ALMs (9%) and 266,496 BRAMs ( < ◦ C). Sequences with differentlengths, from 16,384-bp to 131,072-bp, are tested and theresults are shown in Fig. 6. It takes about 21 ms to finishthe indexing of a 131,072-bp sequence. The runtime is veryclose to our theoretical estimation given in Eq. 7. Even forgenomes with several million base pairs, it is estimated thatour system can construct the FM-index in seconds. V. C
ONCLUSIONS
With many emerging applications based on FM-index, anefficient index construction algorithm is needed. Previous al-gorithms ([9], [10]) need additional working space to build theindex, raising the costs of hardware systems. In this paper, wepropose a novel hardware-compatible Self-Aided IncrementalIndexing (SAII) algorithm to construct FM-index with nomemory overhead. This algorithm is accelerated with a parallelpop counter and a special prefetch mechanism. Its realizationon FPGA needs very few hardware resources and can be easilyintegrated in different FM-index-based applications.A
CKNOWLEDGMENT
This work is supported by the Ministry of Science andTechnology, Taiwan, under Grant numbers MOST 105-2221-E-002-090 and 106-2221-E-002-055. Nae-Chyun Chen wouldlike to thank NOVATEK for providing fellowship.R
EFERENCES[1] M. Burrows and D. J. Wheeler, “A block-sorting lossless data compres-sion algorithm,” 1994.[2] P. Ferragina and G. Manzini, “Opportunistic data structures with appli-cations,” in
Foundations of Computer Science, 2000. Proceedings. 41stAnnual Symposium on . IEEE, 2000, pp. 390–398.[3] H. Li and R. Durbin, “Fast and accurate short read alignment withburrows–wheeler transform,”
Bioinformatics , vol. 25, no. 14, pp. 1754–1760, 2009.[4] ——, “Fast and accurate long-read alignment with burrows–wheelertransform,”
Bioinformatics , vol. 26, no. 5, pp. 589–595, 2010.[5] B. Langmead and S. L. Salzberg, “Fast gapped-read alignment withbowtie 2,”
Nature methods , vol. 9, no. 4, pp. 357–359, 2012.[6] J. T. Simpson and R. Durbin, “Efficient construction of an assemblystring graph using the fm-index,”
Bioinformatics , vol. 26, no. 12, pp.i367–i373, 2010.[7] C. B. Olson, M. Kim, C. Clauson, B. Kogon, C. Ebeling, S. Hauck, andW. L. Ruzzo, “Hardware acceleration of short read mapping,” in
Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20thAnnual International Symposium on . IEEE, 2012, pp. 161–168.[8] H. M. Waidyasooriya and M. Hariyama, “Hardware-acceleration ofshort-read alignment based on the burrows-wheeler transform,”
IEEETransactions on Parallel and Distributed Systems , vol. 27, no. 5, pp.1358–1372, 2016.[9] D. Okanohara and K. Sadakane, “A linear-time burrows-wheeler trans-form using induced sorting.” in
SPIRE , vol. 5721. Springer, 2009, pp.90–101.[10] P. Ferragina, T. Gagie, and G. Manzini, “Lightweight data indexing andcompression in external memory,”
Algorithmica , vol. 63, no. 3, pp. 707–730, 2012.[11] N.-C. Chen, T.-Y. Chiu, Y.-C. Li, Y.-C. Chien, and Y.-C. Lu, “Powerefficient special processor design for burrows-wheeler-transform-basedshort read sequence alignment,” in