Efficient construction of the extended BWT from grammar-compressed DNA sequencing reads
EEfficient construction of the extended BWT fromgrammar-compressed DNA sequencing reads
Diego Díaz-Domínguez ! CeBiB – Centre For Biotechnology and Bioengineering, Department of Computer Science,University of Chile, Santiago, Chile
Gonzalo Navarro ! CeBiB – Centre For Biotechnology and Bioengineering, Department of Computer Science,University of Chile, Santiago, Chile
Abstract
We present an algorithm for building the extended BWT (eBWT) of a string collection from itsgrammar-compressed representation. Our technique exploits the string repetitions captured by thegrammar to boost the computation of the eBWT. Thus, the more repetitive the collection is, thelower are the resources we use per input symbol. We rely on a new grammar recently proposed atDCC’21 whose nonterminals serve as building blocks for inducing the eBWT. A relevant applicationfor this idea is the construction of self-indexes for analyzing sequencing reads — massive andrepetitive string collections of raw genomic data. Self-indexes have become increasingly popular inBioinformatics as they can encode more information in less space. Our efficient eBWT constructionopens the door to perform accurate bioinformatic analyses on more massive sequence datasets, whichare not tractable with current eBWT construction techniques.
Theory of computation → Data compression
Keywords and phrases
BWT, grammar compression, DNA sequencing reads
Digital Object Identifier
Funding
Diego Díaz-Domínguez : ANID Basal Funds FB0001 and Ph.D. Scholarship 21171332, Chile
Gonzalo Navarro : ANID Basal Funds FB0001 and Fondecyt Grant 1-200038, Chile
DNA sequencing reads, or just reads, are massive collections of short strings that encodeoverlapping segments of a genome. In recent years, they have become more accessible to thegeneral public, and nowadays, they are the most common input for genomic analyses. Thisposes the challenge of the high computational cost for extracting information from them.Most bioinformatic pipelines must resort to lossy data structures or heuristics, because readsare too massive to fit in main memory [27]. The FM-index’s [11] success in compressing andindexing DNA sequences [18, 17] opened the door to a new family of techniques that storethe full read data compactly in main memory. This representation is appealing because itretains more information in less space, enabling more accurate biological results [16]. Onepopular FM-index variant for read sets is based on the extended Burrows-Wheeler Transform (eBWT) [22, 1, 23]. The bionformatics and stringology communities are aware of the potentialof the eBWT and have been proposing genomic analysis algorithms on top of eBWT-basedself-indexes for years [7, 9, 14, 25].A bottleneck for the application of those analyses on massive read collections is thehigh computational cost and memory requirements to build the eBWT. Some authors haveproposed semi-external construction algorithms to solve this problem [1, 10, 2]. Still, thesetechniques slow down rapidly as the input read collection grows. © Diego Díaz and Gonzalo Navarro;licensed under Creative Commons License CC-BY 4.042nd Conference on Very Important Topics (CVIT 2016).Editors: John Q. Open and Joan R. Access; Article No. 23; pp. 23:1–23:16Leibniz International Proceedings in InformaticsSchloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany a r X i v : . [ q - b i o . GN ] F e b The most expensive aspect of computing the eBWT is the lexicographical sorting ofthe strings’ suffixes. In this regard, a recent work [3] proposed to speed up the process byexploiting the fact that read sets with high coverage exhibit a good degree of repetitiveness.The goal is to first parse the input to capture long repetitive phrases in the text so as tocarry out the suffix comparisons only on the distinct phrases, and then extrapolate theresults to the rest of the text. This idea works well for sets of assembled genomes, whererepetitiveness is extremely high, but not for reads, where the repetitive phrases are muchshorter and scattered through the strings. Reads sets are much more frequent than fullyassembled genomes in applications.Recently, we presented a fast and space-efficient grammar compressor aimed at read collec-tions [8]. Apart from showing that grammar compression yields significant space reductionson those datasets, we sketched the potential of the resulting grammar to compute the eBWTdirectly from it. Similarly to the idea of Boucher et al. [3], this method takes advantage ofthe short repetitive patterns in the reads to reduce the computational requirements. To thebest of our knowledge, there are no other repetition-aware algorithms to build the eBWT.An efficient realization of this idea would reduce the computational requirements for indexingreads, thus allowing more accurate genomic analyses on massive datasets.
Our contribution.
In this paper we give a detailed description and a parallel imple-mentation of the algorithm for building the eBWT from the grammar of Díaz and Navarro [8].Our approach not only exploits the repetitions captured by the grammar but also the runsof equals symbols in the eBWT. This makes the time and space per input symbol we requirefor the construction decrease as the input becomes more repetitive. Our experiments on realdatasets showed that our method is competitive with the state-of-the-art algorithms, andthat can be the most efficient when the input is massive and with high DNA coverage.
Consider a string T [1 ..n −
1] over alphabet Σ[2 ..σ ], and the sentinel symbol Σ[1] = $ , which weappend at the end of T . The suffix array (SA) [21] of T is a permutation of [ n ] that enumeratesthe suffixes T [ i..n ] of T in increasing lexicographic order, T [ SA [ i ] ..n ] < T [ SA [ i + 1] ..n ]. TheBWT [4] is a permutation of the symbols of T obtained by extracting the symbol thatprecedes each suffix in SA , that is, BW T [ i ] = T [ SA [ i ] −
1] (assuming T [0] = T [ n ] = $ ). Arun-length compressed representation of the BWT [20] adds sublinear-size structures thatcompute, in logarithmic time, the so-called LF step and its inverse: if BW T [ j ] correspondsto T [ i ] and BW T [ j ′ ] to T [ i −
1] (or to T [ n ] = $ if i = 1), then LF ( j ) = j ′ and LF − ( j ′ ) = j .Note that LF regards T as a circular string.Let T = { T , T , ...T m } be a collection of m strings of average size k . We then definethe string T [1 ..n ] = T $ T $ ..T n $ . The extended BWT (eBWT) of T [6] regards it as a setof independent circular strings: the BWT of T is slightly modified so that, if eBW T [ j ]corresponds to T i [1] inside T , then LF ( j ) = j ′ , so that eBW T [ j ′ ] corresponds to the sentinel $ at the end of T i , not of T i − . LOUDS [15] is a succinct representation that encodes an ordinal tree T ′ with t nodes into abitmap B [1 .. t + 1], by traversing its nodes in levelwise order and writing down its arities inunary. The nodes are identified by the position where their description start in B . Adding . Díaz-Domínguez and G. Navarro 23:3 o ( t ) bits on top of B enables constant-time operations like parent ( u ) (the parent of node u ), child ( u, i ) (the i -th child of u ), psibling ( u ) (the sibling preceding u ), nodemap ( u ) (thelevel-wise rank of node u ), leafrank ( u ) (the number of leaves in level-order up to leaf u ), internalrank ( u ) (the rank of the internal node u in level-order), and internalselect ( r ) (theidentifier of the r-th internal node in level order). Grammar compression consists of encoding T as a small context-free grammar G that onlyproduces T . Formally, a grammar is a tuple ( V, Σ , R , S ), where V is the set of nonterminals,Σ is the set of terminals, R is the set of replacement rules and S ∈ V is the start symbol.The right-hand side of S → C ∈ R is referred to as the compressed form of T . The size of G is usually measured in terms of r = |R| ; the number of rules, g ; the sum of the length of theright-hand sides of R , and c = | C | ; the size of the compressed string. The more repetitive T is, the smaller are these values. LMSg [8] is an iterative algorithm aimed to grammar-compress string collections. It is basedon the concept of induced suffix sorting of Nong et al . [24]. The following definitions of [24]are relevant for
LMSg : ▶ Definition 1.
A character T [ i ] is called L-type if T [ i ] > T [ i + 1] or if T [ i ] = T [ i + 1] and T [ i + 1] is also L-type. Equivalently, T [ i ] is said to be S-type if T [ i ] < T [ i + 1] or if T [ i ] = T [ i + 1] and T [ i + 1] is also S-type. By default, symbol T [ n ] , the one with the sentinel,is S-type. ▶ Definition 2. T [ i ] is called LMS-type if T [ i ] is S-type and T [ i − is L-type. ▶ Definition 3.
A LMS substring is (i) a substring T [ i..j ] with both T [ i ] and T [ j ] beingLMS characters, and there is no other LMS character in the substring, for i ̸ = j ; or (ii) thesentinel itself. In every iteration i , LMSg scans the input text T i ( T = T ) from right to left to computethe LMS -substrings. For each
LMS -substring T i [ j..j ′ ], the algorithm discards the firstsymbol. If the remaining phrase F = T [ j + 1 ..j ′ ] has length two or more and at least one ofits symbols is repeated in T i , then it records F in a set D i . If F does not meet the condition,then it discards the phrase and inserts its characters into another set I i . Additionally, when F is the suffix-prefix concatenation of two consecutive strings of T , LMSg splits it in twohalves, F l = F [1 ..u ] and F r [ u + 1 .. | F | ], where F [ u ] contains the dummy symbol. F l and F l are two independent phrases that can be inserted to either D i or I i .After the scan, LMSg sorts the phrases in I i ∪ D i in lexicographical order. If F ∈ D i isa prefix in another phrase F ′ ∈ D i , then the shortest one gets the greatest lexicographicalrank (please see [24]). After sorting, the algorithm creates a new rule X → F for every F ∈ D i , where X is the sum of r ′ ; the highest nonterminal in V before iteration i , plus b ;the lexicographical rank of F in D i ∪ I i . Every symbol Y ∈ I i is a nonterminal that alreadyexists in V , with a rule Y → E , with E ∈ D i ′ and i ′ < i . Hence, LMSg updates its value to Y = r ′ + x , where x is the lexicographical rank of Y in D i ∪ I i . It also updates the occurrencesof Y in the right-hand sides of R to maintain the correctness in the grammar. The laststep in the iteration is to replace the partition phrases in T i with their nonterminal values.This process yields a new text T i +1 for the next iteration. The iterations stop when, after C V I T 2 0 1 6 scanning T i , no phrases are inserted to D i . In such case, the algorithm creates the rule forthe start symbol as S → T i . A graphical example of the procedure is shown in Figure 1A. Post-processing the grammar . LMSg collapses the grammar by decreasing everynonterminal X ∈ V to the smallest available symbol X ′ / ∈ V ∪ Σ. After the collapse, thealgorithm recursively creates new rules with the repeated suffixes of size two in the right-handsof R . These extra rules are called SuffixPair . The final grammar for the example of Figure1A is shown in Figure 1B. For simplicity, the symbols are not collapsed in the example.The
LMSg algorithm ensures the following properties in the grammar to build the eBWT: ▶ Lemma 4.
For two different nonterminals X , Y ∈ V produced in the same iteration of LMSg , if X < Y , then the suffixes of T whose prefixes are compressed as X are lexicographicallysmaller than the suffixes whose prefixes are compressed as Y . ▶ Lemma 5.
Let S = F [ u.. | F | ] be a suffix in a right-hand side F in R . If | S | > andappears as prefix in some suffix of T i , then we can use S to get the lexicographical rank ofthat suffix among the other suffixes prefixed with sequences other than that of S . ▶ Definition 6.
The grammar is string independent if the recursive expansion of every T i [ j ] spans at most one string T j ∈ T . For further detail in these properties, please see [8].
SSS L S*
LLL S*S*S* SSL S*
SSL LS* S S* SS T = g S t L a S* t L t L a S* c L c L $ S* c S t L a S* a S t L a S* g S t L a S* c L c L $ S* g L a S* c S c L a S* g L a S* c L c L a S* g S t L $ S* → → c V 15 → a V 18 → → ta → g V 8 → t V 2 → cc$ → → ga → cca → gt$ →
11 11 12S →
19 16 15 18 17 T = T = T = R (A) (B) Figure 1 (A) Running example of
LMSg . The symbols in gray below T are character types( L-type =L,
S-type =S,
LMS-type =S*). Dashed vertical lines mark the limits between the stringsin T . Every horizontal line on top of T span one of the phrases generated in the iteration oneof LMSg . The parse tree of the grammar is depicted on top of T . Light gray nonterminals havefrequency one in T i . Dashed edges indicate symbols whose enclosing phrases were discarded for D i .The character at the top of every nonterminal denotes its suffix type. (B) Rules after postprocessingthe grammar of (A). For clarity, the nonterminals were not collapsed. Dark gray characters are SuffPair nonterminals. The character S below R is the start symbol of G . The grammar tree . Díaz and Navarro use a slightly-modified version of the grammartree data structure of [5] to encode G . The construction algorithm is as follows; it starts bycreating an empty tree P . Then, it scans the parse tree of G in level-order. Every time thetraversal reaches a new node y , the algorithm obtains its label X . If the extracted symbol isa nonterminal and is the first time it appears, then it creates a new internal node u with | F | children in P , where F is the replacement of X in R . The label l of u is the number of . Díaz-Domínguez and G. Navarro 23:5 g
15 13 t
15 14 c c $ c t a a
15 1012 14 111618 g a c c a
16 17 g t $
S1987 16 15 1152 V 18 173 12 c a
15 12 14 16 g t c c $ t a g t $ g a c c a (1) child (3, 3) = 16 (2) child (16, 2) = 41 (4) internalselect (15 − σ ) = 37 (3) label (41) = 15 (B)(A) Figure 2 (A) The grammar tree of Figure 1B. Numbers on top of the internal nodes are theoriginal nonterminals of the grammar. Dashed arrows simulate a traversal over the parse tree of G to decompress the word ta from T [14 ..
15] (Figure 1). (B) LOUDS encoding for (A). The bitstreamstores the shape of the tree. Gray numbers on top are the bit indexes. Gray numbers below thestream are the internal ranks of the nodes. The integer vector below the stream contains the leaflabels. Dashed arrows mark the same decompression path as in (A), but using the LOUDS functions. internal nodes in P so far plus σ , the number of terminals in G . If, on the other hand, y is not the first parse tree node labeled with X we visit in the traversal, then the algorithmcreates a leaf labeled with l in P and discards the subtree of y . Finally, if X is a terminal,then the algorithm creates a new leaf labeled with X . The shape of P is then stored usingLOUDS, and the leaf labels are stored in a compressed array [26]. The internal node labelsare not explicitly stored but retrieved on the fly by using the LOUDS operation internalrank .The grammar tree for the grammar of Figure 1B is shown in Figure 2. The figure also showshow to simulate in P a traversal over the parse tree of G by using the LOUDS operations. The grammar tree discards the original nonterminal values, but they are necessary to buildthe eBWT. Díaz and Navarro gave an algorithm called
GLex that reconstructs those valuesdirectly from the shape of P . The procedure yields a set of h triplets, where h is the numberof iterations of LMSg . Each iteration i has its triplet ( L i , R i , f i ). The set L i contains thelabels of P for the phrases in D i ∪ I i , the set R i has the lexicographical ranks of those phrasesand f i is a function f i ( l ) = b that maps a label l ∈ L i to its rank b ∈ R i . For a detailedexplanation of the method, we refer the reader to [8].For simplicity, we consider R i − to be the alphabet of T i during the construction of theeBWT, altough this is not strictly true. Every nonterminal X = r ′ + b ∈ V in T i is the sumof r ′ ; the highest nonterminal before iteration i , and b ; the lexicographical rank of the phrase F ∈ D i − ∪ I i − to which X is assigned. In other words, GLex only recovers the b part of X . Still, replacing the nonterimnals of T i with their ranks in R i − makes no difference forinferring the eBWT as we will see in later sections. C V I T 2 0 1 6
Our algorithm for computing the eBWT of T is called infBWT . It first produces the eBWT B h of C = T h . Then, it iterates from h to 2 to infer the eBWT B i − of T i − from the alreadycomputed transform B i . When the iterations finish, infBWT returns B as the eBWT of T .The overview of the procedure is depicted in Algorithm 1. Line 6 is explained later. Algorithm 1
Overview of infBWT proc infBWT ( P ) ▷ returns the eBWT of T GLex ( P ) ▷ produces the h triplets ( L i , R i , f i ) Load triplet h from disk Compute the eBWT B h of C from triplet h and P for i = h to 2 do Replace f i with f iinv in triplet i Load triplet i − B i − ← nextBWT ( B i , f iinv , L i − , R i − , f i − ) Discard B i and triplet i i ← i − end for return B end proc Unlike the regular BWT, the position of each C [ j ] in the eBWT does not depend on thewhole suffix C [ j + 1 ..c ], but on the string S = C [ j + 1 ..j + p ′ ] C [ j − p..j ]. This sequence is acircular permutation of the compressed string T x ∈ T encoded in the range C [ j − p..j + p ′ ].Computing S from the grammar tree P is simple as G is string independent (see Definition 6).This feature implies that every T x ∈ T maps a specific range C [ k..k ′ ], with k ≤ k ′ . Therefore,we do not have to deal with border cases. For instance, when the compressed suffix or prefixof T x lies in between two consecutive symbols of C .For constructing the eBWT of C we require P and ( L h , R h , f h ), the last triplet producedby GLex . Given the definition of P , we can easily obtain the root child v encoding C [ j ] as v = nodeselect ( j + 1). Once we retrieve v , we obtain C [ j ] with f h ( label ( v )). For accessing thecircular string S to the right of C [ j ] we define the function cright . This procedure receivesas input a position j ∈ [1 ..c ] and returns another position j ′ ∈ [1 ..c ] such that C [ j ′ ] is thecircular right context of C [ j ]. We use cright as the underlying operator for another function, ccomp . This method compares lexicographically two circular permutations located at differentpositions of C . Similarly, we define a function cleft that returns the circular left context of C [ j ]. We use it to get the eBWT symbols once we sort the circular permutations. To supportthese operations, we consider the border cases C [ k ′ + 1] = C [ k ] and C [ k −
1] = C [ k ′ ] for every T x . These exceptions require us to include a bitmap U [1 ..c ] that marks as U [ j ] = 1 everyroot child in P whose recursive expansion is suffixed by a dummy symbol. The functions cleft , cright and ccomp are described in Algorithm 2.We start the computation of the eBWT of C by creating a table A [1 ..c ] with | R h | lexicographical buckets. Then, we scan the children of the root of P from left to right, andfor every node v , we store its child rank in the leftmost available cell of bucket f h ( label ( v )).This process yields a partial sorting of circular permutations of C ; every bucket b contains . Díaz-Domínguez and G. Navarro 23:7 the strings that start with symbol b . To finish the sorting, we apply a local quicksort in everybucket, using ccomp as the comparison function. Notice these calls to quicksort should befast as most of the buckets have few elements, and the amount of right contexts we have toaccess in every comparison is small. Finally, we produce B h by scanning A from left to rightand pushing every symbol f h ( label ( nodeselect ( cleft ( A [ j ]) + 1))) with j ∈ [1 .. | A | ]. Algorithm 2
Functions to simulate circularity over C Require:
A bitmap U [1 .. | C | ] marking the symbols of C expanding to phrases suffixed by $ . proc cright ( j ) ▷ returns a j ′ such that C [ j ′ ] is the circular right context of C [ j ] if U [ j ] then j ← j − while U [ j ] is false do j ← j − end while end if return j + 1 end proc proc cleft ( j ) ▷ returns a j ′ such that C [ j ′ ] is the circular left context of C [ j ] if U [ j − then while U [ j ] is false do j ← j + 1 end while return j else return j − end if end proc proc ccomp ( a , b ) ▷ circular lexicographical comparison of C [ a ] and C [ b ] r ← f h ( label ( nodeselect ( a + 1))) r ← f h ( label ( nodeselect ( b + 1))) while r ̸ = r do a ← cright ( a ), b ← cright ( b ) r ← f h ( label ( nodeselect ( a + 1))) r ← f h ( label ( nodeselect ( b + 1))) end while return r < r end proc We define a method called nextBWT for inferring B i − from B i (Line 8 of Algorithm 1).This procedure requires us to have a mechanism to map a symbol B i [ j ] to its phrase F ∈ D i − ∪ I i − . We support this feature by replacing f i with an inverted function f iinv that receives a rank b ∈ R i and returns its associated P label l ∈ L i (Line 6 of Algorithm 1).Thus, we obtain the grammar tree node v that encodes F with internalselect ( l − σ ). Tospell the sequence of F we proceed as follows; we use P to simulate a pre-order traversal C V I T 2 0 1 6 over the subtree of v in the parse tree of G . When we visit a new node v ′ , we check if itslabel l ′ = label ( v ′ ) belongs to L i − . If that so, then we push f i − ( l ′ ) to F and skip theparse subtree under v ′ . The process stop when we reach v again. We call this process thelevel-decompression of F , or ldc ( v ) = F .If we level-decompress all the phrases encoded in B i , then we obtain all the symbols of T i − ,although not sorted according the eBWT’s definition. The position of each decompressedsymbol F [ u ] ∈ R i − in B i − depends, in most of the cases, only on the suffix F [ u + 1 .. | F | ],except when this suffix has length less than two (Lemma 5). In addition, if two symbols F [ u ]and F ′ [ u ′ ], level-decompressed from different positions B i [ j ] and B i [ j ′ ] (respectively), arefollowed by the same suffix in their respective phrases F and F ′ , then their relative orders in B i − only depend on the values of j and j ′ . This property is formally defined as follows: ▶ Lemma 7.
Let B i [ j ] and B i [ j ′ ] be two eBWT symbols at different positions j and j ′ , with j < j ′ , and whose level-decompressed phrases are F and F ′ , respectively. Also let S j and S j ′ be suffixes of F and F ′ with the same sequence S . The occurrence S j is lexicographicallysmaller than S j ′ as it level-decompressed first in B i . Proof. As S j and S j ′ are equal, their relative orders depend on the lexicographical ranks ofthe phrases to the (circular) right of F and F ′ in the partition of T i − . As B i [ j ] appearsbefore (from left to right) than B i [ j ′ ], its right context is lexicographically smaller than theright context of B i [ j ′ ]. Therefore, S j is also lexicographically smaller than S j ′ . ◀ If we generalize Lemma 7 to x ≥ S , then we can use the following Lemmafor building B i − : ▶ Lemma 8.
Let S be a string of length at least two, and with x occurrences in T i − . Let J = j , j , ..., j x be the positions of B i encoding the phrases where those occurrences appearas suffixes. Assume we scan J from left to right and for every suffix occurrence of S , weextract its left symbol and push it to a list O . The resulting list will match a consecutiverange B i − [ o..o ′ ] of length o ′ − o + 1 = x . Finding the range of O in B i − requires to sort the distinct suffixes in D i − ∪ I i − inlexicographical order. Thus, if S has rank b , then O is the b-th range of B i − . We refer tothese suffixes as the context strings of the symbols in B i − . The problem, however, is thatthe suffixes in D i − of length one are ambiguous as we cannot assign them a lexicographicalorder (Lemma 5). We solve this problem by extending the occurrences of the ambiguouscontext strings with the phrases in D i − ∪ I i − that occur to their (circular) right in T i − .By doing this extension, we induce a partition over the O list of an ambiguous S ; the symbolsin first range O X = O [ z..z ′ ] precede the occurrences of the extended phrase SX , the symbolsin the second range O Y = O [ z ′ + 1 ..z ′′ ] precede the occurrences of another phrase SY , andso on. In this way, every segment of O can be placed in some contiguous range of B i − .We build an FM-index from B i to compute the string extensions, but without includingthe SA. The idea is that when we extract the occurrence of an ambiguous S from B i [ j ], weuse LF − ( j ) = j ′ to get the position in B i with the symbol that occurs to the right of B i [ j ]in T i . In this way, we concatenate S to the phrase F obtained from level-decompressing thesymbol B i [ j ′ ]. Equivalently, we use LF to find the left symbols of the context strings thatare not proper suffixes of their enclosing phrases.When we sort the distinct context strings, we reference them by using nodes of P . If S has length two an appears as suffix in different right-hand sides of R , then there is a SuffPair internal node in P for it (nodes y , y y S appears assuffix only in one right-hand side, then there is a node (leaf or internal) that we can use as . Díaz-Domínguez and G. Navarro 23:9 identifier (node y S is not a proper suffix in the right-hand side ofthe rule, then we use the internal node in P for that rule (node v of Figure 3).We begin nextBWT by initializing two empty semi-external lists, Q and Q ′ . Then, we scan B i from left to right. For every B i [ j ], we first perform LF ( j ) = j ′ to find the position in B i with the symbol that occur to the left of B i [ j ] in T i . Then, we use the functions f iinv to obtainthe internal nodes u and v in P that encode the phrases of B i [ j ′ ] and B i [ j ], respectively.Subsequently, we level-decompress the phrase W ∈ D i − ∪ I i − of u and push the pair ( b, v )to Q , where b ∈ R i − is the last symbol of W . The next step is to level-decompress thephrase of v . During the process, when we reach a node whose label is in L i − , we get itsassociated symbol b ∈ R i − and its right sibling y . The next action depends on the value of l y = label ( y ). If l y belongs to L i − and y is not the rightmost child of its parent, then wepush the pair ( b, y ) to Q . If, on the other hand, y is the rightmost child of its parent, but l y does not belong to L i − , then we update its value to y = internalselect ( l y − σ ) and then push( b, y ) into Q . The purpose of the internalselect operation is to get the internal node with thefirst occurrence of l y in P . The last option is that l y belongs to L i − and y is the rightmostchild of its parent. In such situation, we extend the context of y . We use LF − ( j ) = j ′′ toget the position with the symbol to the right of B i [ j ]. As before, we get the internal node z associated to B i [ j ′′ ] and finally push the triplet ( b, y, z ) to Q ′ .Once we complete the scan of B i , we group the pairs in Q according to y , withoutchanging their relative orders. In Q ′ we do the same, but we sort the elements accordingto ( y, z ). Subsequently, we form a sorted set U with the distinct y symbols of Q plus thedistinct pairs ( y, z ) of Q ′ . To get the relative order of two elements in U , we level-decompresstheir associated phrases and compare them in lexicographical order. For the pairs ( y, z ),their phrases are obtained by concatenating the level-decompression of y and z . Finally, if agiven value y ∈ Q has rank b among the other elements of U , then we place the left symbolsof its range in Q at the b-th range of B i − . The same idea applies for the pairs ( y, z ) of Q ′ .Algorithm 3 implements nextBWT and Figure 3 depicts our method for processing B i [ j ]. The nextBWT algorithm works provided all the occurrences of a context string S appear assuffixes in the phrases of D i − ∪ I i − . Still, this is not always the case as sometimes theyoccur in between nonterminals. ▶ Definition 9.
Let F = XYZ be a string in D i − with an occurrence in T i − [ j..j + 2] . Thissubstring is said to be an implicit occurrence of F if, during the parsing of LMSg , T i − [ j ] = X becomes a suffix of the phrase at T i [ j − p..j ] , with p ≥ , and T i [ j + 1 ..j + 2] = YZ isconsidered another phrase A ∈ D i − . An implicit occurrence appears when F [1] is classified as LMS-type during the parsing.We use the following lemma to detect this situation: ▶ Lemma 10.
The node v encoding F in P has two children, the left one has a label in L i − and the right one is a SuffPair node whose label is in L i . Proof.
Assume the new rules for F and A after partitioning T i − are F → XYZ and A → YZ ,respectively. As YZ is a repeated suffix, LMSg has to create a
SuffPair nonterminal for it,but it already exists, it is A . Thus, LMSg reuses it and replaces YZ with A in F ’s rule. ◀ Before running nextBWT , we scan L i from left to right to find the internal nodes of P that meet Lemma 10. For every node v ′ that meets the lemma, we create a pair p = ( l l , l r ), C V I T 2 0 1 6
Algorithm 3
Inferring B i − from B i Require: P proc nextBWT ( j , f iinv , L i − , R i − , f i − ) ▷ Produces B i − from B i Q ← Q ′ ← ∅ for j = 1 to | B i | do j ′ ← LF ( j ) ▷ left symbol of B i [ j ] in T i v ← internalselect ( f iinv ( B i [ j ]) − σ ) u ← internalselect ( f iinv ( B i [ j ′ ]) − σ ) W ← ldc ( u ) push pair ( W [ | W | ]] , v ) to Q if label ( v ) / ∈ L i − then ▷ v decompresses to a phrase in D i − b ← f i − ( label ( child ( v, y ← child ( v, while true do if nsib ( y ) ̸ = 0 then ▷ y is not the rightmost child of its parent push pair ( b, y ) to Q b ← f i − ( label ( y )) y ← nsibling ( y ) else if label ( y ) / ∈ L i − then ▷ SuffPair node y ← internalselect ( label ( y ) − σ ) push pair ( b, y ) to Q b ← f i − ( label ( child ( y, y ← child ( y, else j ′′ ← LF − ( j ) ▷ right symbol of B i [ j ] in T i z ← internalselect ( f iinv ( B i [ j ′′ ]) − σ ) push triplet ( b, y, z ) to Q ′ break end if end if end while end if end for U ← distinct y values of Q ∪ distinct ( y, z ) pairs of Q ′ Sort the strings encoded in U Reorder the elements of Q ∪ Q ′ according the ranks of U B i − ← left symbols of Q ∪ Q ′ return FM-index of B i − end proc with the labels in P of its left and right children, respectively. Then, we record p in a hashtable H , with v ′ as its value. During the execution of nextBWT , when we level-decompressthe symbol to the left of B i [ j ] (Line 7 of Algorithm 3), we check if the pair formed by thegrammar tree label of W [ | W | ] and the label f iinv ( B i [ j ]) has an associated value v ′ in H . Ifthat happens, then we insert ( W [ | W | − , v ′ ) to Q . . Díaz-Domínguez and G. Navarro 23:11 Run-length decompression.
As we descend over the iterations of infBWT , the size of B i increases, but its alphabet decreases. This means that in the last steps of the algorithm weend up decompressing a small number of phrases several times. To reduce the time overheadproduced by this monotonous decompression pattern, we exploit the runs of equal symbolsin B i . More specifically, if a given range in B i [ j..j ′ ] of length x = j ′ − j + 1 has the samesymbol X , then we level-decompress the associated phrase once and multiply the result by x instead of performing the same operations x times. By Lemma 7, we already now that therepeated symbols resulted from this run-length decompression will be placed in consecutiveranges of B i − . Therefore, when we process a given run ( x, X ) of B i , we do not push everypair ( b, y ) x times to Q , but insert ( x, b, y ) only once. We do the same for the triplets ( b, y, z )of Q ′ . Figures 3B and 3C show an example of how the runs of B i are processed. Unique context strings.
Another way of exploiting the repetitions in B i is with theunique context phrases. Consider a nonterminal X encoded in the grammar tree node v . If agiven child y of v has a label in L i − , then the context phrase S decompressed from y onlyappears under the subtree of v . This means that if X has x occurrences in B i , then S has x occurrences in T i − , all with the same left context symbol. Computing the elements in B i − with unique context strings is simple; for each label l ∈ L i , we get the internal node v = internalselect ( l − σ ), and its number of children a = nchildren ( v ). If a >
2, then we countthe x occurrences of f i ( l ) ∈ R i in B i by calling rank over the FM-index. Then, for everychild y of v whose child rank is in the range [2 ..a − x, b, y ) to Q , where b ∈ R i − is the symbol obtained from the left sibling of y . Then, during the scan of B i , wejust skip these y nodes. In infBWT , we carry out this process before inverting f i (Line 6 ofAlgorithm 1). In Figure 3A, node y in the subtree of v encodes a unique context string. Inferring the eBWT in parallel.
Our algorithm for building Q and Q ′ can run inparallel because of Lemma 8. For doing so, we divide B i in p chunks and run the for loopbetween lines 3-32 of Algorithm 3 independently for every one of them. This process yields pQ and Q ′ lists, which we concatenate as Q = Q ∪ Q , .. ∪ Q p afterward (we do the samewith the Q ′ lists). Then we continue with the inference of B i − in serial. Sorting the context strings.
There is a trade-off between time and space in sorting U (Line 34 of Algorithm 3). On hand side, decompressing the strings once and maintain themin plain form to compare them can produce a considerable space overhead. On the other,decompressing them every time we access them might be slow in practice. We opted forsomething in between; we only access the strings when we compare them, but we amortizethe decompression time overhead by doing the sorting in parallel. Our approach is simple;we use a serial countingsort to reorder the strings of U by their first symbols. We logicallydivide the resulting U in buckets; all the strings beginning with the same symbol are groupedin a bucket. To finish the sorting, we just perform quicksort on parallel in every bucket,decompressing the strings on demand. We implemented infBWT as a
C++ tool called G2BWT. This software uses the
SDSL-lite lib-rary [12] and is available at https://bitbucket.org/DiegoDiazDominguez/lms_grammar/src/bwt_imp2 . We compared the performance of G2BWT against the tools eGap (EG) [10], gsufsort-64 (GS64) [19] and
BCR_LCP_GSA (BCR) [1]. EG and BCR are algorithms forconstructing the eBWT of a string collection in external or semi-external settings, whileGS64 is an in-memory algorithm for building the suffix array of an string collection, but
C V I T 2 0 1 6 c c g c c a vy y y y y B i [ j ] a a a a t a zB i [ j (cid:48)(cid:48) ] c g g g c a uB i [ j (cid:48) ] LF LF − Q (3, a , v )(4, c , y )(4, c , y )(4, g , y )(4, c , y ) Q (cid:48) (2, c , y , z ) B i F
17 1517 1517 1518 15..15 915 915 1015 10.. LF (A) (B) (C) Figure 3 (A) Example processing of a symbol B i [ j ] = by the nextBWT algorithm. Thesubtree in the parse tree of G encoding the phrase F = ccgcca ∈ D i − ∪ I i − of B i [ j ] is depictedbelow. The gray symbols are the nodes in P from which we decode the suffixes of F . Node y is aunique proper suffix, and nodes y − are SuffPair suffixes. The left context of in T i is , and itis represented in P with the internal node u . Its right context, , is represented with the internalnode z . (B) Pairs inserted to Q and Q ′ during the processing of B i [ j ]. The leftmost symbols arethe frequencies of the suffixes. (C) Processing of B i [ j ] in run-length mode. Arrays B i and F arethe last and first columns of the FM-index, respectively. In B i , there is a run for of length 4.Within the run, symbol appears two times as the right context of and appears 3 times asits left context. The nextBWT algorithm level-decompresses the phrases and only once, anduses the run lengths to include the suffix frequencies to the tuples of Q and Q ′ . This information isrepresented with dashed lines in the left side of B i . Dashed arrows from Q and Q ′ to B i indicatefrom which parts of the FM-index the suffix frequencies were extracted. it can also build the eBWT. We also considered the tool bwt-lcp-em [2] for the exper-iments. Still, by default it builds both the eBWT and the LCP array, and there is nooption to turn off the LCP array, so we discarded it. For BCR, we used the implementa-tion of https://github.com/giovannarosone/BCR_LCP_GSA . All the tools were compiledaccording their authors’ description. For G2BWT, we used the compiler flags -O3 -msse4.2-funroll-loops .We used five read collections produced from different human genomes for building theeBWTs. We concatenated the strings so that dataset 1 contained one collection, dataset2 contained two collections and so on. All the reads were 152 characters long and had analphabet of six symbols ( A,C,G,T,N,$ ). The input datasets are described in Table 1.During the experiments, we limited the RAM usage of EG to at most three times theinput size. For BCR, we turned off the construction of the data structures other than theeBWT, and left the memory parameters by default. In the case of GS64, we used theparameter –bwt to indicate that only the eBWT had to be built. The other options were leftby default. For G2BWT, we first grammar-compressed the datasets using
LPG [8] an thenused the resulting files as input for building the eBWTs. In addition, we let G2BWT to useup to 18 threads. The other tools ran in serial as none of them supported multi-threading.All the competitor tools manipulate the input in plain form while G2BWT processesthe input in compressed space. In this regard, BCR, EG, and GS64 have an extra cost fordecompressing the text that G2BWT does not have. To simulate that cost, we compressedthe datasets using p7-zip and then measured the time for decompressing them. We assessedthe performance of G2BWT first without adding that cost to BCR, EG, and GS64, and thenadding it. All the experiments were carried out on a machine with Debian 4.9, 736 GB of . Díaz-Domínguez and G. Navarro 23:13 RAM, and processor Intel(R) Xeon(R) Silver @ 2.10GHz, with 32 cores.
Number of Plain Compressed % eBWTcollections size (GB) size (GB) runs
Table 1
Input datasets for building the eBWTs. The compressed size is the space of the
LPG representation. The percentage of eBWT runs of a dataset is measured as its number of eBWT runsdivided by its total number of symbols, and multiplied by 100.
The results of our experiments without considering the decompression costs for BCR, GS64and EG are summarized in Figure 4. The fastest method for building the eBWT was GS64,with a mean time of 0.91 µ secs per input symbol. It is then followed by BCR, G2BWT andEG, with mean times of 0.94, 1.32 and 2.61 µ secs per input symbol, respectively (Figure 4A).Regarding the working space, the most efficient method was BCR, with an average spaceof 0.17 bytes of RAM per input symbol. G2BWT is the second most efficient, with anaverage of 0.78 bytes. EG and GS64 are much more expensive, using 3.07 and 10.54 byteson average, respectively (Figure 4B). Adding the decompression overhead to BCR, GS64and EG increases the running times by 0.02 µ secs per input symbols in all the cases. Thisnegligible penalty is due to p7-zip is fast at decompressing the text.Despite G2BWT is not the most efficient method on average, it is the only one thatbecomes more efficient as the size of the collection increases. While the space and timefunctions of BCR, EG and GS64 seem to be linear with respect the input size, and with anon-negative slope in most of the cases, in G2BWT these functions resemble a decreasinglogarithmic function (see Figures 5A and 5B). This behavior is due to G2BWT processesseveral occurrences of a given phrase as a single unit, unlike the other methods. Thus,the cost of building the eBWT depends more on the number of distinct patterns in theinput rather than on its size. As genomes are repetitive, appending new read collectionsto a dataset increases its size considerably, but not its number of distinct patterns. As aconsequence, the per-symbol processing efficiency increases.The performance improvement of G2BWT is also observed when we asses the trade-offbetween time and space (Figure 4C). Although BCR is the one with the best trade-off, theinstances of G2BWT move toward the bottom-left corner of the graph as we concatenatemore read collections. In other words, the more massive and repetitive the input is, the lessis the time and space we spend on average per input symbol to build the eBWT. This is animportant remark, considering the input collections are not as repetitive as other types oftexts. In most of the datasets, the number of eBWT runs is relatively high (see Table 1).We estimated how many read collections we have to append to achieve a performanceequal or better than that of BCR. For that purpose, we built a linear regression for BCR anda logaritmic regression for G2BWT. Subsequently, we checked at which value of x the twomodels intersect (Figure 5). For time, we obtained the function y = 0 . . x C V I T 2 0 1 6 µ s e cs i npu t sy m bo l Collection
Method
G2BWTBCRGS64EG B y t e s / i npu t sy m bo l Tool
G2BWTBCRGS64EG µ s e cs i npu t sy m bo l Tool
G2BWTBCRGS64EG (A) (B) (C) µ s e cs i npu t sy m bo l Collection
Method
G2BWTBCRGS64EG µ s e cs i npu t sy m bo l Collection
Method
G2BWTBCRGS64EG
Figure 4
Performance of the tools for building the eBWTs. These results do not include thedecompression overhead for BCR, GS64, and EG. for BCR while for G2BWT we obtain y = 2 . − ln 0 . x . The value where these twofunctions intersect in the x-axis is around 111 (dashed vertical line of Figure 5A). Thus, weexpect BCR and G2BWT to have a similar performance when the size of the input readdataset is about 111GB. For datasets above that value, we expect G2BWT to be faster.For the space, the function for BCR was y = 0 . − . x and for G2BWTwas y = 1 . − ln 0 . x . In this case, the functions do not intersect. This is expected asBCR is implemented almost entirely on disk. Therefore, its RAM consumption stays lowand stable as the input increases. On the other hand, the RAM usage of G2BWT increasesat a slow rate, but it still depends on the input size. µ s e cs i npu t sy m bo l Tool
G2BWTBCR B y t e s / i npu t sy m bo l Method
G2BWTBCR (A) (B) µ s e cs i npu t sy m bo l Collection
Method
G2BWTBCRGS64EG
Figure 5
Regressions for the performance of BCR and G2BG in Figure 4. (A) Model for thetime. Dashed vertical line indicates where the two functions intersect. (B) Model for the space.
We introduced a method for building the eBWT that works in compressed space, and whoseperformance improves as the text becomes massive and repetitive. Our experimental resultsshowed that algorithms that work on top of grammars can be competitive in practice, andeven more efficient. Our research is now focused on improving the time of infBWT andextending its use to build other data structures such as the LCP array.
References M. Bauer, A. Cox, and G. Rosone. Lightweight algorithms for constructing and inverting theBWT of string collections.
Theoretical Computer Science , 483:134–148, 2013. P. Bonizzoni, G. Della Vedova, Y. Pirola, M. Previtali, and R. Rizzi. Computing the multi-stringBWT and LCP array in external memory.
Theoretical Computer Science , 2020. . Díaz-Domínguez and G. Navarro 23:15 C. Boucher, T. Gagie, A. Kuhnle, B. Langmead, G. Manzini, and T. Mun. Prefix-free parsingfor building big BWTs.
Algorithms for Molecular Biology , 14:article 13, 2019. M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. TechnicalReport 124, Digital Equipment Corporation, 1994. F. Claude and G. Navarro. Improved grammar-based compressed indexes. In
Proc. 19thSPIRE , pages 180–192, 2012. A. Cox, M. Bauer, T. Jakobi, and G. Rosone. Large-scale compression of genomic sequencedatabases with the Burrows–Wheeler transform.
Bioinformatics , 28:1415–1419, 2012. A. Cox, T. Jakobi, G. Rosone, and O. Schulz-Trieglaff. Comparing DNA sequence collectionsby direct comparison of compressed text indexes. In
Proc. 12th WABI , pages 214–224, 2012. D. Díaz-Domínguez and G. Navarro. A grammar compressor for collections of reads withapplications to the construction of the BWT. In
Proc. 31st DCC , 2021. D. Dolle, Z. Liu, M. Cotten, J. Simpson, Z. Iqbal, R. Durbin, S. McCarthy, and T. Keane.Using reference-free compressed data structures to analyze sequencing reads from thousandsof human genomes.
Genome Research , 27:300–309, 2017. L. Egidi, F. Louza, G. Manzini, and G. Telles. External memory BWT and LCP computationfor sequence collections with applications.
Algorithms for Molecular Biology , 14:aritcle 6, 2019. P. Ferragina and G. Manzini. Indexing compressed text.
Journal of the ACM , 52(4):552–581,2005. S. Gog, T. Beller, A. Moffat, and M. Petri. From theory to practice: Plug and play withsuccinct data structures. In
Proc. 13th SEA , pages 326–337, 2014. R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In
Proc.14th SODA , pages 841–850, 2003. V. Guerrini and G. Rosone. Lightweight metagenomic classification via eBWT. In
Proc. 6thAlCoB , pages 112–124, 2019. G. Jacobson. Space-efficient static trees and graphs. In
Proc. 30th FOCS , pages 549–554,1989. Alice Kaye and Wyeth Wasserman. The genome atlas: Navigating a new era of referencegenomes.
Trends in Genetics , 2021. B. Langmead, C. Trapnell, M. Pop, and S. Salzberg. Ultrafast and memory-efficient alignmentof short DNA sequences to the human genome.
Genome Biology , 10:article R25, 2009. H. Li and R. Durbin. Fast and accurate short read alignment with Burrows–Wheeler Transform.
Bioinformatics , 25:1754–1760, 2009. F. Louza, G. P Telles, S. Gog, N. Prezza, and G. Rosone. gsufsort: constructing suffix arrays,LCP arrays and BWTs for string collections.
Algorithms for Molecular Biology , 15:1–5, 2020. V. Mäkinen, G. Navarro, J. Sirén, and N. Välimäki. Storage and retrieval of highly repetitivesequence collections.
Journal of Computational Biology , 17:281–308, 2010. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches.
SIAMJournal of Computing , 22:935–948, 1993. S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An extension of the Burrows-WheelerTransform.
Theoretical Computer Science , 387:298–312, 2007. G. Navarro and V. Mäkinen. Compressed full-text indexes.
ACM Computing Surveys , 39:article2, 2007. G. Nong, S. Zhang, and W. H. Chan. Linear suffix array construction by almost pureinduced-sorting. In
Proc. 19th DCC , pages 193–202, 2009. N. Prezza, N. Pisanti, M. Sciortino, and G. Rosone. SNPs detection by eBWT positionalclustering.
Algorithms for Molecular Biology , 14:article 3, 2019. E. Schwartz and B. Kallick. Generating a canonical prefix encoding.
Communications of theACM , 7:166–169, 1964. Z. Stephens, S. Lee, F. Faghri, R. Campbell, C. Zhai, M. Efron, R. Iyer, M. Schatz, S. Sinha,and G. Robinson. Big data: astronomical or genomical?
PLoS Biology , 13(7):e1002195, 2015.
C V I T 2 0 1 6
Once we infer B i − , we produce a RLFM-index from it. This task is not difficult as theresulting B i − is actually half-way of being run-length compressed. The only drawback of thisrepresentation is that manipulating B i − can be slow. The RLFM-Index usually representsthe BWT using a Wavelet Tree [13]. In our case, this feature implies that accessing a symbolin B i and performing rank has a slowdown factor of O (log r ). This value can be too slowfor our purposes. In practice, we use a bit-compressed array to represent B i − instead of aWavelet Tree. We also include an integer vector K of the same size of B i − . In every K [ j ]we store the rank of symbol B i − [ j ] up to position j . Thus, when we need to perform LF over B i − [ j ], we replace the rank part in the equation with K [ j ]. Notice it is not necessaryto fully load K into main memory as its access pattern is linear. We load it in chunks as wescan B i − during iteration i − B Improving the algorithm even further
The most expensive aspect of our algorithm is the decompression of phrases. First when wescan B i , and then when we sort the suffixes in U . Every time we require a string, we extractit on demand directly from the grammar and then we discard its sequence to avoid usingextra space. This process is expensive in practice as accessing the grammar several timesmakes an extensive use of rank and select operations. The run-length decompression of thesymbols of B i along with the parallel sorting of U try to deal with this overhead, but we cando better.Consider, for instance, three eBWT runs X , Y and X placed consecutively at somerange of B i . In the for loop of nextBWT (lines 3-32 of Algorithm 3), both X and X aredecompressed independently as we are unaware that they are close one each other. We cansolve this problem by maintaining a small hash table that record the information of the lastvisited symbols of B i . Thus, when we reach a new eBWT run, we check first if its symbol isin the hash table. If so, then we extract the information of its suffixes from the hash andpush it into Q and Q ′ . If not, then we level decompress the run symbol from scratch andstore its suffix information in the hash table to avoid extra work in the future. We can limitthe size of the hash table so it is always maintained in some of the L1-3 caches, and when weexceed this limit, we just simply reset the hash.Notice that the copies of the suffixes of X and X will be placed consecutively in B i − .We can exploit that fact by not adding the suffix information of X to Q and Q ′ . Instead,we increase the frequency of the recently added tuples of X by the length of run of X . Inthis way, we not only save computing time, but also working space.Another way of improving the performance is during the sorting of U . When we sortthe distinct buckets in parallel using quicksortquicksort