[PDF] A Big Data Approach for Sequences Indexing on the Cloud via Burrows Wheeler Transform

Abstract

Indexing sequence data is important in the context of Precision Medicine, where large amounts of ``omics'' data have to be daily collected and analyzed in order to categorize patients and identify the most effective therapies. Here we propose an algorithm for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our approach is the first that distributes the index computation and not only the input dataset, allowing to fully benefit of the available cloud resources.

Full PDF

aa r X i v : . [ c s . D C ] J u l A Big Data Approach for Sequences Indexing on the Cloud viaBurrows Wheeler Transform

Mario RandazzoDMI, University of Palermo Simona E. RomboDMI, University of Palermo

Abstract

Indexing sequence data is important in the context of Precision Medicine, where large amounts of “omics” datahave to be daily collected and analyzed in order to categorize patients and identify the most effective therapies. Herewe propose an algorithm for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e.,Apache Spark and Hadoop. Our approach is the ﬁrst that distributes the index computation and not only the inputdataset, allowing to fully beneﬁt of the available cloud resources.

Precision Medicine aims to design individualized strategies for diagnostic or therapeutic decision-making, basedon both genotypic and phenotypic information. It allows scientists and clinicians to understand which therapeuticand preventive approaches to a speciﬁc illness can work effectively in subgroups of patients based on their geneticmake-up, lifestyle, and environmental factors [15]. The diffusion of high-throughput assays, such as next-generationsequencing (NGS) and mass spectrometry (MS), has led to fast accumulation of sequences and other omics datawhich can be used to enable Precision Medicine in practice. As an example, speciﬁc disease biomarkers may beidentiﬁed by cleaning up raw data generated by NGS or MS, and then experimentally validated in laboratory.An important problem in this context is the indexing of NGS data [8]. In particular, an index is a data structure thatenables efﬁcient retrieval of stored objects. Indexing strategies used in NGS allow space-efﬁcient storage of biologicalsequences in a full-text index that enables fast querying, in order to return exact or approximate string matches.Popular full-text index data structures include variants of sufﬁx arrays [1], FM-index based on the Burrows–Wheelertransform (BWT) and some auxiliary tables [4], and hash tables [9]. The choice of a speciﬁc index structure is often atrade-off between query speed and memory consumption. For example, hash tables can be very fast but their memoryfootprint is sometimes prohibitive for large string collections [14].Here we address the problem of computing BWT in the distributed, exploiting Big Data technologies such asApache Spark [16]. In particular, previous research has been proposed on the BWT computation in a MapReduce[3] fashion based on Apache Hadoop [12]. The use of Spark and Hadoop together, as proposed here, have shownto notably improve the performance in several application contexts, due to the optimal exploitation of both memoryand cloud. Another available tool relying on Hadoop and BWT computation is BigBWA [2]. However, the BigBWAparallelism is intended only to split the input sequences and then apply another existing framework, i.e., BWA [10],in order to align them via BWT. Therefore, the BWT computation is not based itself on Big Data technologies inBigBWA.We propose an algorithm for BWT computation that fully exploits parallelism afforded by a cloud computingenvironment, combining advantages of MapReduce paradigm and Spark Resilient Distributed Datasets (RDD). Thepresented strategy is based on the computation of Sufﬁx Arrays in the distributed by revisiting the idea of preﬁxdoubling presented in [6]. Validation results obtained on real biological datasets, including genomic and proteomicdata, are provided, showing that our approach improves the performance for BWT computation with respect to itscompetitors. 1 ll rotations of S Lexicographic sorting

BAN AN A $ AN AN A $ B $ BAN AN A AN A $ BA N A $ BAN AN A $ BAN A N N A $ BAN A BAN AN A $ AN A $ BAN N AN A $ B A N AN A $ BA N A $ BAN A AN AN A $ B $ BAN AN A Figure 1: Example of BWT. i Sufﬁxes Sorted Sufﬁxes SA[i] BAN AN A $ AN AN A $ 22

AN AN A $ AN A $ 43

N AN A $ A $ 64 AN A $ BAN AN A $ 15

N A $ N AN A $ 36 A $ N A $ 57 $ $ 7

Figure 2: Example of Sufﬁx Array.

Let S be a string of n characters deﬁned on the alphabet Σ . We denote by S ( i ) the i -th character in S and by S i its i -th sufﬁx. We recall the following basic notions. BWT

The Burrows-Wheeler transform of S is useful in order to rearrange it into runs of similar characters. Thismay have advantages both for indexing and for compressing more efﬁciently S . The BWT applied to S returns: • a permutation bwt ( S ) of S , obtained by sorting all its circular shifts in lexicographic order, and then extractingthe last column; • the index (0-based) I of the row containing the original string S .Among the most important properties of BWT, it is reversible. Figure 1 shows an example of BWT for the string S = BAN AN A $ . In particular, bwt ( S ) = BN N $ AAA , and I = 3 . Sufﬁx Array

The sufﬁx array SA of S is deﬁned as an array of integers providing the starting positions of sufﬁxesof S in lexicographical order. Therefore, an entry SA [ i ] contains the starting position of the i -th sufﬁx in S amongthose in lexicographic order. Figure 2 shows the Sufﬁx Array for the same example of BWT. Inverse Sufﬁx Array

Inverse Sufﬁx Array of S , ISA [ i ] = j means that the rank of the sufﬁx i is j , i.e., SA [ j ] = i . The more crucial aspect for the BWT computation considered here is the calculation of the Sufﬁx Array of the inputstring. Indeed, BWT can be calculated from the Sufﬁx Array in a MapReduce fashion via join operation. Therefore,the algorithm proposed for the computation of the Sufﬁx Array, based on the idea of preﬁx doubling inspired by [6],is described below. 2 nput:

Let S be a string of length n , the input tuples set is:Input = { ( null , S ( i )) : i = 1 , . . . , n ) } In the following, we assume that the input string S ends with a sentinel character $ . This is typically used in theinverse transformation, however it is unimportant for the purpose of the algorithm. Output:

A set of tuples of the form ( i, r ) , where i is the index of a sufﬁx in S and r is its rank (i.e., its position inthe list of the sorted sufﬁxes). In the literature this is referred to as the ISA. For our purpose, the resulting output isinverted in order to obtain the Sufﬁx Array of S and, then, its BWT. Initialization:

The ﬁrst step is starting from the Input set and initialize the set of tuples ( i, r ) , as described in theprevious paragraph. In this phase, the rank is calculated by the ﬁrst character of the sufﬁx. In particular, let Occ ( c ) bethe number of occurrences of the character lexicographically smaller of c in the string S , then the rank of the sufﬁx i can be determined as Occ ( S ( i )) .In a MapReduce fashion, this can be accomplished by ﬁrst counting the occurrences of each character in S , and thencomputing the cumulative sum Occ on the sorted counts. The map and reduce steps are: map: ( null , S ( i )) → ( S ( i ) , reduce: ( c, list [1 , , . . . , → ( c, sum of ones ) From this

Occ is calculated locally by collecting the result.The

ISA set can be then initialized with the following map step: map: ( null , S ( i )) → ( i, occ ( S ( i ))) ISA Extending:

The next step is to extend each rank contained in the initialized

ISA by the whole sufﬁx. Here weuse a technique called

Preﬁx Doubling that can be summarized as:

Given that the sufﬁxes of a string are already sorted by their preﬁx of length h , we can deduce their ordering by theirpreﬁx of length h . Given two sufﬁxes S i and S j with an identical preﬁx of length h , we can deduce their sorting by comparing the orderof the sufﬁxes S i + h and S j + h . Thus the idea is to pair, for each sufﬁx S i , its rank with the rank of the sufﬁx S i + h (i.e., ( ISA [ i ]) , ISA [ i + h ]) ) and sort all these pairs in order to obtain the sorting by the preﬁx of length h . Indeed,an iteration double the preﬁx, since the longest sufﬁx has size n , all sufﬁxes will be sorted after at most log ( n ) iterations. Shifting and Pairing

To implement the above idea in a MapReduce fashion, we apply the two considered mapsteps to the latest

ISA calculated to obtain two different sets: map: ( i, r ) → ( i, ( r, map: ( i, r ) → ( i − k , ( − r, where k is the number of the iterations minus one. The indices of rank are shifted this way, then the rank is paired bya reduce step. It is worth noticing that a negative number is used to denote a shifted rank, and the value is mapped asa tuple with a zero term in order to consider the ranks shifted that overﬂow the string length.The union of the two obtained sets is considered and all tuples with a negative key are discarded (the correspondingranks do not pair with any other rank in the set). The following reduce step is applied to the union: reduce: ( i, list [( r , , ( r , → ( i, ( r , − r where r is the rank shifted. Some ranks may occur that are not reduced due to the unique key. These ranks overﬂowthe length of S and remain paired with zero. We denote the ﬁnal set derived from this phase by P airs .3 A B N $ Occ ( c ) Occ ( c ) . Re-Ranking

Our purpose is to extend the previous rank with a new rank, obtained by considering the preﬁx dou-bled. Therefore, we compute the new rank according to the tuple in

P airs as follows: ﬁrs we sort all tuples by value,then we compare each tuple at position i (after sorting) with the one in position i − . If they are equal, the new rankis equal to the rank of the previous tuple, otherwise the new rank is i . Finally, a new ISA set with rank extended isobtained, and the procedure is iterated on it again. All operations described above can be achieved also in a distributedmanner: • For the sorting operation, a certain number of partitions can be identiﬁed by range into roughly equal rangesthe elements in the set (the ranges can be determined by sampling the data). Then for each partition a sortingalgorithm is applied that sort each partition locally. This is easily provided by the framework Apache Spark. • In order to compute the new rank, the partition identiﬁed previously is considered and the procedure above isapplied locally, as described before, using the length of the partition and the offset (i.e., the number of elementsin the previous partition) for computing the position of the tuples.

Let S =

BANANA $ be the input string of length n = 7 . The input pairs are:Input = { ( null , B ) , ( null , A ) , ( null , N ) , ( null , A ) , ( null , N ) , ( null , A ) , ( null , $) } As for

Occ ( c ) , it is shown in Table 1.After the initialization, the initial ISA set is:ISA = { (0 , , (1 , , (2 , , (3 , , (4 , , (5 , , (6 , } After the ﬁrst iteration, the shifted tuples are:Shifted = { ( − , ( − , , (0 , (0 , , (1 , ( − , , (2 , (0 , , (3 , ( − , , (4 , (0 , , (5 , ( − , } After the the pairing we obtain the set:Pairs= { (0 , (3 , , (1 , (0 , , (2 , (4 , , (3 , (0 , , (4 , (4 , , (5 , (0 , , (6 , (6 , } Finally, we sort by value and we re-rank the indices. Then the new ISA is:ISA = { (0 , , (1 , , (2 , , (3 , , (4 , , (5 , , (6 , } We observe that the only rank updated in this iteration is the one with index , indeed shifting by it is possible todistinguish among the preﬁxes AN , AN and A $ corresponding to the sufﬁxes S , S and S . The presented algorithm has been evaluated on real datasets taken from the Pizza & Chili website [11], where a setof text collections of various types and sizes are available to test experimentally compressed indexes. In particular,the text collections stored on this website have been selected to form a representative sample of different applicationswhere indexed text searching might be useful. From this collection, we have chosen the following three datasets:4 nput Time

Competitor Proposed AlgorithmProteins.200MB (25 MB) 4.25 minutes 3.86 minutesProteins.200MB (100 MB) 3.05 hours 11.36 minutesProteins.200MB (Full) - 22.03 minutesDna.200MB - 26.78 minutesEnglish.1024MB - 3.4 HoursTable 2: Performance comparison between our algorithm and its competitor. • PROTEINS, containing a sequence of newline-separated protein sequences obtained from the Swissprot database. • DNA, a sequence of newline-separated gene DNA sequences obtained from ﬁles of the Gutenberg Project. • ENGLISH, the concatenation of English text ﬁles selected from collections of the Gutenberg Project.The main competitor of our approach is the algorithm proposed in [12], which is based on MapReduce as welland divides the sufﬁx array construction into multiple independent ranges that are then independently solved. In moredetail, the partition points are selected and sorted, the ranges are well-balanced across processing cores, and the ﬁnaloutput forms a total order to the array.We have implemented in Apache Spark both the algorithm described here and the approach proposed in [12], inorder to provide a suitable comparison, and we have run them on the GARR Cloud Platform. In particular, we haveconﬁgured the cluster with master and slave nodes, each node with VCore, GB of RAM and

GB fordisk. We have used Apache Hadoop . . and Spark . . .Results are shown in Table 2 (when the running time was larger than hours it has not been reported). For thePROTEINS dataset, it has been considered the only ﬁrst MB, the only ﬁrst

MB and the full dataset.It is evident that our approach outperforms its competitor on all considered datasets. This is what we expecteddue to the fact that, as already discussed in the Introduction, we have presented the ﬁrst method which introducesparallelism in the computation of BWT, allowing to fully beneﬁt of cloud computing. It is worth notice that thealgorithm in [12] has been implemented here in Apache Spark, therefore it has been put in equal terms than our onein the comparison (i.e., the difference in performance cannot be referred to the fact that Spark allows optimizationswith respect to Hadoop for a more efﬁcient use of memory).

We have proposed a MapReduce algorithm for the implementation of a full-text index, that is, the Burrows Wheelertransform. We have implemented our approach in Apache Spark and we have proved by experimental evaluation thatit is more efﬁcient than its competitors already proposed in the literature.Among the various applications where an efﬁcient and distributed implementation of BWT may be useful (e.g.,data compression, pattern matching, etc.), and with a special attention to the Precision Medicine context, we mentionthat searching for a suitable combination of Indexing and Machine Learning techniques has recently proved to be apromising issue [7, 13, 5]. Therefore, we plan to focus our future studies in this direction.

Part of the research presented here has been funded by the MIUR-PRIN research project “Multicriteria Data Structuresand Algorithms: from compressed to learned indexes, and beyond”, grant n. 2017WR7SHH.

References [1] Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing sufﬁx trees with enhanced sufﬁxarrays.

J. Discrete Algorithms , 2(1):53–86, 2004. 52] Jos´e Manuel Abu´ın, Juan Carlos Pichel, Tom´as F. Pena, and Jorge Amigo. Bigbwa: approaching the burrows-wheeler aligner to big data technologies.

Bioinformatics , 31(24):4003–4005, 2015.[3] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: a ﬂexible data processing tool.

Commun. ACM , 53(1):72–77,2010.[4] Paolo Ferragina and Giovanni Manzini. Indexing compressed text.

J. ACM , 52(4):552–581, 2005.[5] Paolo Ferragina and Giorgio Vinciguerra. The pgm-index: a fully-dynamic compressed learned index withprovable worst-case bounds.

Proc. VLDB Endow. , 13(8):1162–1175, 2020.[6] Patrick Flick and Srinivas Aluru. Parallel distributed memory construction of sufﬁx and longest common preﬁxarrays. In

Proceedings of the International Conference for High Performance Computing, Networking, Storageand Analysis, SC 2015, Austin, TX, USA, November 15-20, 2015 , pages 16:1–16:10, 2015.[7] Daniel J. Graham and Brian P. Robinson. On the internal correlations of protein sequences probed by non-alignment methods: Novel signatures for drug and antibody targets via the Burrows-Wheeler transform.

Chemo-metrics and Intelligent Laboratory Systems , 193:103809, 2019.[8] Vahid Jalili, Matteo Matteucci, Marco Masseroli, and Stefano Ceri. Indexing next-generation sequencing data.

Inf. Sci. , 384:90–109, 2017.[9] Wan-Ping Lee, Michael P. Stromberg, Alistair Ward, Chip Stewart, Erik P. Garrison, and Gabor T. Marth.Mosaik: A hash-based algorithm for accurate next-generation sequencing short-read mapping.

PLoS One ,9(3):e90581, 2014.[10] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows-wheeler transform.

Bioin-formatics , 25(14):1754–1760, 2009.[11] Giovanni Manzini and Gonzalo Navarro. The pizza and chili corpus home page.

Web site: http://pizzachili. dcc.uchile , 2007.[12] Rohith K. Menon, Goutham P. Bhat, and Michael C. Schatz. Rapid parallel genome indexing with mapreduce.In

Proceedings of the Second International Workshop on MapReduce and Its Applications , MapReduce ’11,pages 51–58, New York, NY, USA, 2011. Association for Computing Machinery.[13] Edward Raff, Charles Nicholas, and Mark McLean. A new burrows wheeler transform markov distance, 2019.[14] Bertil Schmidt and Andreas Hildebrandt. Next-generation sequencing: big data meets high performance com-puting.

Drug Discovery Today , 22(4):712–717, 2017.[15] Jason L. Vassy, Bruce R. Korf, and Robert C. Green. How to know when physicians are ready for genomicmedicine.

Science Translational Medicine , 7(287):287fs19–287fs19, 2015.[16] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J.Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In