Sampling the suffix array with minimizers
aa r X i v : . [ c s . D S ] D ec Sampling the suffix array with minimizers
Szymon Grabowski and Marcin Raniszewski
Lodz University of Technology, Institute of Applied Computer Science,Al. Politechniki 11, 90–924 L´od´z, Poland { sgrabow|mranisz } @kis.p.lodz.pl Abstract.
Sampling (evenly) the suffixes from the suffix array is anold idea trading the pattern search time for reduced index space. A fewyears ago Claude et al. showed an alphabet sampling scheme allowing formore efficient pattern searches compared to the sparse suffix array, forlong enough patterns. A drawback of their approach is the requirementthat sought patterns need to contain at least one character from the cho-sen subalphabet. In this work we propose an alternative suffix samplingapproach with only a minimum pattern length as a requirement, whichseems more convenient in practice. Experiments show that our algorithmachieves competitive time-space tradeoffs on most standard benchmarkdata.
Full-text indexes built over a text of length n can roughly be divided into twocategories: those requiring at least n log n bits and the more compact ones.Classical representatives of the first group are the suffix tree and the suffix array.Succinct solutions, often employing the Burrows–Wheeler transform and otheringenious mechanisms (compressed rank/select data structures, wavelet trees,etc.), are object of vivid interest in theoretical computer science [19], but theirpractical performance does not quite deliver; in particular, the locate query issignificantly slower than using e.g. the suffix array [21,9,11].A very simple, yet rather practical alternative to both compressed indexesand the standard suffix array is the sparse suffix array (SpaSA) [15]. This datastructure stores only the suffixes at regular positions, namely those being amultiple of q ( q > q searches, in q − P [1 . . .
6] is tomcat and q = 4, weneed to search for tomcat , omcat , mcat and cat , and 3 of these 4 searches willbe followed by verification. Obviously, the pattern length must be at least q andthis approach generally works better for longer patterns.The sampled suffix array (SamSA) by Claude et al. [4] is an ingenious alter-native to SpaSA. They choose a subset of the alphabet and build a sorted arrayover only those suffixes which start with a symbol from the chosen subalpha-bet. The search starts with finding the first (leftmost) sampled symbol of thepattern, let us say at position j , and then the pattern suffix P [ j . . . m ] is soughtn the sampled suffix array with standard means. After that, each occurrence ofthe pattern suffix must be verified in the text with the previous j − Our purpose is to combine the benefits of the sparse suffix array (searching anypatterns of length at least the sampling parameter q ) and the sampled suffixarray (one binary search). To this end, we need the following property: For each substring s of T , | s | = q , there exists its substring s ′ , | s ′ | ≤ q , suchthat among the sampled suffixes there exists at least one which starts with s ′ .Moreover, s ′ is obtained from s deterministically, or in other words: for any twosubstrings of T , s and s , if s = s , then s ′ = s ′ . This property is satisfied if a minimizer of s is taken as s ′ . The idea of mini-mizers was proposed by Roberts et al. in 2004 [22] and seemingly first overlookedin the bioinformatics (or string matching) community, only to be revived in thelast years (cf., e.g., [18,16,3,24]). The minimizer for a sequence s of length r isthe lexicographically smallest of its all ( r − p + 1) p -grams (or p -mers, in the termcommonly used in computational biology); usually it is assumed that p ≪ r . Fora simple example, note that two DNA sequencing reads with a large overlap arelikely to share the same minimizer, so they can be clustered together. That is,the smallest p -mer may be the identifier of the bucket into which the read isthen dispatched.Coming back to our algorithm: in the construction phase, we pass a slidingwindow of length q over T and calculate the lexicographically smallest substringf length p in each window (i.e., its minimizer). Ties are resolved in favor ofthe leftmost of the smallest substrings. The positions of minimizers are startpositions of the sampled suffixes, which are then lexicographically sorted, likefor a standard suffix array. The values of q and p , p ≤ q , are construction-timeparameters.In the actual construction, we build a standard suffix array and in extrapass over the sorted suffix indexes copy the sampled ones into a new array. Thisrequires an extra bit array of size n for storing the sampled suffixes and in totalmay take O ( n ) time and O ( n ) words of space. In theory, we can use one of tworandomized algorithms by I et al. [14] which sort n ′ = o ( n ) arbitrary suffixes oftext of length n either in O ( n ) time using O ( n ′ log n ′ ) words of space (MonteCarlo algorithm), or in O ( n log n ′ ) time using O ( n ′ ) words of space (Las Vegasalgorithm).There is a small caveat: the minimizer at the new sampled position may beequal to the previous one, if only the window has been shifted beyond the positionof its previous occurrence. The example illustrates. We set q = 5, p = 1 and thetext is T = Once upon a time . In the first window (
Once ) the minimizer willbe the blank space and it does not change until upon (including it), but thenext window ( upon ) also has a blank space as its minimizer, yet it is a newstring, in a different position. Both blank spaces are thus prefixes of the suffixesto sample.The search is simple: in the prefix P [1 . . . q ] of the pattern its minimizer isfirst found, at some position 1 ≤ j ≤ q − p + 1, and then we binary searchthe pattern suffix P [ j . . . m ], verifying each tentative match with its truncated( j − P [ i . . . i + q − ≤ i ≤ m − q + 1,could be chosen to find its minimizer and continue the search over the sampledsuffix array, but using no such window can result in a narrower range of suffixesto verify than the one obtained from the pattern prefix. This is because for anynon-empty string s with occ s occurrences in text T , we have occ s ≥ occ xs , where xs is the concatenation of a non-empty string x and string s .We call the described algorithm as the sampled suffix array with minimizers (SamSAMi). There are two free parameters in SamSAMi, the window length q and the mini-mizer length p , p ≤ q . Naturally, the case of p = q is trivial (all suffixes sampled,i.e. the standard suffix array obtained). For a settled p choosing a larger q hasa major benefit: the expected number of selected suffixes diminishes, which re-duces the space for the structure. On the other hand, it has two disadvantages: q is also the minimum pattern length, which excludes searches for shorter pat-terns, and for a given pattern length m ≥ q the average length of its soughtsuffix P [ j . . . m ] decreases, which implies more occcurrence verifications.For a settled q the optimal choice of the minimizer length p is not easy; toosmall value (e.g., 1) may result in frequent changes of the minimizer, especiallyor a small alphabet, but on the other hand its too large value has the sameeffect, since a minimizer can be unchanged over at most p − q + 1 successivewindows. Yet, the pattern suffix to be sought has in the worst case exactly p symbols, which may be a suggestion that p should not be very small. For some texts and large value of q the number of verifications on the patternprefix symbols tends to be large. Worse, each such verification requires a lookupto the text with a likely cache miss. We propose a simple idea reducing thereferences to the text.To this end, we add an extra 4 bits to each SamSAMi offset. Their store thedistance to the previous sampled minimizer in the text. In other words, the listof distances corresponds to the differences between successive SamSAMi offsetsin text order. For the first sampled minimizer in the text and any case wherethe difference exceeds 15 (i.e., could not fit 4 bits), we use the value 0. To givean example, if the sampled text positions are: 3, 10, 12, 15, 20, then the list ofdifferences is: 0, 7, 2, 3, 5. In our application the extra 4 bits are kept in the 4most significant bits of the offset, which restricts the 32-bit offset usage to textsup to 256 MB.In the search phase, we start with finding the minimizer for P [1 . . . q ], at someposition 1 ≤ ℓ ≤ q − p + 1, and for each corresponding suffix from the index weread the distance to the previous minimizer in the text. If its position is alignedin front of the pattern, or the read 4 bits hold the value 0, we cannot drawany conclusion and follow with a standard verification. If however the previousminimizer falls into the area of the (aligned) pattern, in some cases we canconclude that the previous ℓ − ℓ − P = ctgccact , q = 5, p = 2. The minimizer in the q long prefix of P is cc , starting at position P [4].Assume that P is aligned with a match in T . If we shift the text window leftby 1 symbol and consider its minimizer, it may be either ct (corresponding to P [1 . . . ?c , where ? is an unknown symbol aligned just before the patternand c aligned with P [1], if ?c happens to be lexicographically smaller than ct . If,however, the distance written on the 4 bits associated with the suffix ccact... of T is 1 or 2, we know that we have a mismatch on the pattern prefix and theverification is over (without looking up the text), since neither gc or tg cannotbe the previous minimizer. Finally, if the read value is either 0 or at least 4, wecannot make use of this information and invoke a standard verification. In [12] we showed how to augment the standard suffix array with a hash table,to start the binary search from a much more narrow interval. The start andend position in the suffix array for each range of suffixes having a commonprefix of length k was inserted into the hash table, where the key for whichthe hash function was calculated was the prefix string. The same function waspplied to the pattern’s prefix and after a HT lookup the binary search wascontinued with reduced number of steps. The mechanism requires m ≥ k . Toestimate the space needed by the extra table, the reader is advised to look atTable 1, presenting the number of distinct q -grams in five 200 MB datasets fromthe popular Pizza & Chili text corpus. For example for the text english thenumber of distinct 8-grams is 20,782,043, which is about 10% of the text length.This needed to be multiplied by 16 in our implementation (open addressing withlinear probing and 50% load factor and two 4-byte integers per entry), whichresults in about 1.6 n bytes overhead. q dna english proteins sources xml Table 1.
The number of distinct q -grams (1 . . .
8) in the datasets. Each datasetis of length 209,715,200 bytes.We can adapt this idea to SamSAMi. Again, the hashed keys will be k -longprefixes, yet now each of the sampled suffixes starts with some minimizer (or itsprefix). We can thus expect a smaller overhead. Its exact value for a particulardataset depends on three parameters, k , q and p . Note however that now thepattern length m must be at least max( q − p + k, q ). All SA-like indexes refer to the text, so to reduce the overall space we cancompress it. It is possible to apply a standard solution to it, like Huffman orHu–Tucker [13] coding (where the idea of the latter is to preserve lexicograph-ical order of the code and thus enable direct string comparisons between thecompressed pattern and the compressed text), but in SamSAMi it is more con-venient to compress the text with aid of minimizers. More precisely, we partition T [1 . . . n ] into phrases: T [1 . . . j ] , T [ j + 1 . . . j ] , . . . , T [ j n ′ − + 1 . . . n ], n ′ ≤ n , j ≥
0, where each T [ j i + 1] location is a start position of a new minimizer,considering all q -long text windows moved from the beginning to the end of thetext, for the chosen parameters q and p . Note that n ′ /n is the compression ratio(between 0 and 1) of the suffix array sampling. The resulting sequence of phrases T ′ [1 . . . n ′ ] is then compressed with a byte code [2]. The dictionary of phrases D has to be stored too. We note that q shouldn’t be too large in this variant,therwise the phrases will tend to have a single occurrence and the dictionaryrepresentation will be dominating in space use.In this variant we assume that m ≥ q − p + 1. Searching for the patternproceeds as follows. First the minimizer in P [1 . . . q ] is found, at some position1 ≤ j ≤ q − p + 1. Then the minimizer in P [ j + 1 . . . j + q ] is found, at someposition j + 1 ≤ j ≤ j + q − p + 1. This means that the pattern comprisesthe phrase P [ j . . . j − D . If P [ j . . . m ] comprises k extra phrases, k ≥
1, then all of them are also translatedto their codewords from D . The resulting concatenation of codewords for k + 1phrases, spanning P [ j . . . j k +1 −
1] in the pattern, is the artificial pattern P ′ to bebinary searched in the suffix array with n ′ sampled suffixes. Still, all the suffixesin the range starting with the encoding of P ′ have to be verified, both with thepattern prefix (of length j −
1) and pattern suffix (of length m − j k +1 + 1).Each candidate occurrence is verified with decoding its preceding phrase in thetext and then performing a comparison on the prefix, and decoding its followingphrase in text with an analogous comparison.We note that the same text encoding can be used for online pattern search(cf. [10]). We have implemented three variants of the SamSAMi index: the basic one (de-noted as
SamSAMi on the plots), the one with reduced verifications (
SamSAMi2 )and the basic one augmented with a hash table (
SamSAMi-hash ). We comparedthem against the sparse suffix array (
SpaSA ), in our implementation, and thesampled suffix array (
SamSA ) [4], using the code provided by its authors. Com-pression of the text (Sect. 2.5) has not been implemented.All experiments were run on a computer with an Intel i7-4930K 3.4 GHz CPU,equipped with 64 GB of DDR3 RAM and running Ubuntu 14.04 LTS 64-bit. Allcodes were written in C++ and compiled with g++ 4.8.2 with -03 option.We start with finding the fraction of sampled suffixes for multiple ( q, p ) pa-rameter pairs and the five 50 MB Pizza & Chili datasets. Table 2 presents theresults.Pattern searches were run for m ∈ { , , , } and for each dataset andpattern length 500,000 randomly extracted patterns from the text were used.Figs 1–4 present average search times with respected to varying parameters. ForSpaSA we changed its parameter k from 1 (which corresponds to the plain suffixarray) to 8. For SamSAMi we varied q from { , , , , , , , , , , , } setting the most appropriate p (up to 3 or 4) to obtain the smallest index,according to the statistics from Table 2. Obviously, q was limited for m < m = 10, up to 16 for m = 20, and up to 40 for m = 50.We note that SamSAMi is rather competitive against the sparse suffix array,with two exceptions: short patterns ( m = 10) and the XML dataset (for m = 10and m = 20). In most cases, SamSAMi is also competitive against the sampledsuffix array, especially when aggressive suffix sampling is applied. (For a honest p dna english proteins sources xml Table 2.
The percentage of suffixes that are sampled using the idea of minimizerswith the parameters q and p t i m e ( u s / q u e r y ) dna50 (m = 10) SamSASamSAMiSamSAMi2SpaSA t i m e ( u s / q u e r y ) english50 (m = 10) SamSASamSAMiSamSAMi2SpaSA t i m e ( u s / q u e r y ) proteins50 (m = 10) SamSASamSAMiSamSAMi2SpaSA t i m e ( u s / q u e r y ) sources50 (m = 10) SamSASamSAMiSamSAMi2SpaSA t i m e ( u s / q u e r y ) xml50 (m = 10) SamSASamSAMiSamSAMi2SpaSA
Fig. 1.
Pattern search time (count query). All times are averages over 500Krandom patterns of length 10. The patterns were extracted from the respectivetexts. Times are given in microseconds. The index space is a multiple of the textsize, including the text. t i m e ( u s / q u e r y ) dna50 (m = 20) SamSASamSAMiSamSAMi2SamSAMi-hashSpaSA t i m e ( u s / q u e r y ) english50 (m = 20) SamSASamSAMiSamSAMi2SamSAMi-hashSpaSA t i m e ( u s / q u e r y ) proteins50 (m = 20) SamSASamSAMiSamSAMi2SamSAMi-hashSpaSA t i m e ( u s / q u e r y ) sources50 (m = 20) SamSASamSAMiSamSAMi2SamSAMi-hashSpaSA t i m e ( u s / q u e r y ) xml50 (m = 20) SamSASamSAMiSamSAMi2SamSAMi-hashSpaSA
Fig. 2.
Pattern search time (count query). All times are averages over 500Krandom patterns of length 20. The patterns were extracted from the respectivetexts. Times are given in microseconds. The index space is a multiple of the textsize, including the text. t i m e ( u s / q u e r y ) dna50 (m = 50) SamSASamSAMiSamSAMi2SamSAMi-hashSpaSA t i m e ( u s / q u e r y ) english50 (m = 50) SamSASamSAMiSamSAMi2SamSAMi-hashSpaSA t i m e ( u s / q u e r y ) proteins50 (m = 50) SamSASamSAMiSamSAMi2SamSAMi-hashSpaSA t i m e ( u s / q u e r y ) sources50 (m = 50) SamSASamSAMiSamSAMi2SamSAMi-hashSpaSA t i m e ( u s / q u e r y ) xml50 (m = 50) SamSASamSAMiSamSAMi2SamSAMi-hashSpaSA
Fig. 3.
Pattern search time (count query). All times are averages over 500Krandom patterns of length 50. The patterns were extracted from the respectivetexts. Times are given in microseconds. The index space is a multiple of the textsize, including the text. t i m e ( u s / q u e r y ) dna50 (m = 100) SamSASamSAMiSamSAMi2SamSAMi-hashSpaSA t i m e ( u s / q u e r y ) english50 (m = 100) SamSASamSAMiSamSAMi2SamSAMi-hashSpaSA t i m e ( u s / q u e r y ) proteins50 (m = 100) SamSASamSAMiSamSAMi2SamSAMi-hashSpaSA t i m e ( u s / q u e r y ) sources50 (m = 100) SamSASamSAMiSamSAMi2SamSAMi-hashSpaSA t i m e ( u s / q u e r y ) xml50 (m = 100) SamSASamSAMiSamSAMi2SamSAMi-hashSpaSA
Fig. 4.
Pattern search time (count query). All times are averages over 500Krandom patterns of length 100. The patterns were extracted from the respectivetexts. Times are given in microseconds. The index space is a multiple of the textsize, including the text.omparison one should also notice that our implementation uses 32-bit suffixindexes while the Claude et al. scheme was tested with ⌈ log n ⌉ bits per index,which is 26 bits for the used datasets.)Unfortunately, the variant with reduced verifications ( SamSAMi2 ) is not sig-nificantly faster than the original one, only in rare cases, with a large value ofthe used q , the search time can be approximately halved. SamSAMi-hash, onthe other hand, can be an attractive alternative, similarly as SA-hash used as areplacement of the plain SA [12]. We presented a simple suffix sampling scheme making it possible to search forpatterns effectively. The resulting data structure, called a sampled suffix arraywith minimizers (SamSAMi), achieves interesting time-space tradeoffs; for ex-ample, on English50 dataset the search for patterns of length 50 is still by about10% faster than with a plain suffix array when only 5.3% of the suffixes areretained.Apart from extra experiments, several aspects of our ideas require furtherresearch. We mentioned a theoretical solution for building our sampled suffixarray in small space can be applied, but it is an interesting question if we canmake use of our parsing properties to obtain O ( n ) time and O ( n ′ ) space in theworst case. Such complexities are possible for the suffix array on n ′ words, asshown by Ferragina and Fischer [8], and their idea can easily be used for thesampled SA by Claude et al. [4] as noted in the cited work.How to find minimizers efficiently, both in a static sequence (i.e., a patternprefix) and a sliding window, is also of some interest. Na¨ıve implementationsresult in O ( pq ) and O ( npq ) times, respectively, but with a heap the latter canbe reduced to O ( np log q ). One solution to get rid of the factor q can be to usethe Rabin-Karp rolling hash [5, Sect. 32.2] over the substrings of length p andfind the minimum hash value rather than the lexicographically lowest substring.Also, a heap may be replaced with a trie storing the p -grams. Assuming constant-time parent-child navigation over the trie (i.e., also a small enough alphabet),we update the trie for one shift of the window in O ( p ) time (as one p -gramis removed, one p -gram is added, and the minimizer is the leftmost string inthe trie), which results in O ( np ) overall time. For finding the minimizer in thepattern prefix we may use a rather theoretical option involving linear-time suffixsorting. To this end, we first remap the pattern prefix alphabet to (at most) { , , . . . , q − } , in O ( q log q ) time, using a balanced BST. Next we sort thepattern prefix suffixes (using, e.g., the algorithm from [20]) and finally scan overthe sorted list of suffixes until the first suffix of length at least p is found. Thetotal time is O ( q log q ). cknowledgement We thank Kimmo Fredriksson for helpful comments and Francisco Claude forsharing with us the sampled suffix array sources.The work was supported by the Polish Ministry of Science and Higher Edu-cation under the project DEC-2013/09/B/ST6/03117 (both authors).
References
1. S. Alstrup, G. S. Brodal, and T. Rauhe. Pattern matching in dynamic texts. In
Proceedings of the 11th Annual Symposium on Discrete Algorithms (SODA) , pages819–828. Society for Industrial and Applied Mathematics, 2000.2. N. Brisaboa, A. Fari˜na, G. Navarro, and J. Param´a. Lightweight natural languagetext compression.
Information Retrieval , 10:1–33, 2007.3. R. Chikhi, A. Limasset, S. Jackman, J. Simpson, and P. Medvedev. On the repre-sentation of de Bruijn graphs. arXiv preprint arXiv:1401.5383 , 2014.4. F. Claude, G. Navarro, H. Peltola, L. Salmela, and J. Tarhio. String matchingwith alphabet sampling.
Journal of Discrete Algorithms (JDA) , 11:37–50, 2012.5. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.
Introduction to Algo-rithms (3. ed.) . MIT Press, 2009.6. P. Crescenzi, A. D. Lungo, R. Grossi, E. Lodi, L. Pagli, and G. Rossi. Text sparsifi-cation via local maxima. In
Proceedings of the 20th Conference on the Foundationsof Software Technology and Theoretical Computer Science(FSTTCS) , volume 1974of
LNCS , pages 290–301. Springer, 2000.7. P. Crescenzi, A. D. Lungo, R. Grossi, E. Lodi, L. Pagli, and G. Rossi. Text sparsi-fication via local maxima.
Theoretical Computer Science , 1–3(304):341–364, 2003.8. P. Ferragina and J. Fischer. Suffix arrays on words. In
CPM , volume 4580 of
LNCS , pages 328–339. Springer–Verlag, 2007.9. P. Ferragina, R. Gonz´alez, G. Navarro, and R. Venturini. Compressed text indexes:From theory to practice.
ACM Journal of Experimental Algorithmics (JEA) , 13:ar-ticle 12, 2009. 30 pages.10. K. Fredriksson and S. Grabowski. A general compression algorithm that supportsfast searching.
Information Processing Letters , 100(6):226–232, 2006.11. S. Gog and M. Petri. Optimized succinct data structures for massive data.
Software–Practice and Experience , 2013. DOI: 10.1002/spe.2198.12. S. Grabowski and M. Raniszewski. Two simple full-text indexes based on the suffixarray.
CoRR , abs/1405.5919, 2014.13. T. C. Hu and A. C. Tucker. Optimal computer search trees and variable-lengthalphabetical codes.
SIAM Journal on Applied Mathematics , 21(4):514–532, 1971.14. T. I, J. K¨arkk¨ainen, and D. Kempa. Faster sparse suffix sorting. In
STACS ,volume 25 of
LIPIcs , pages 386–396. Schloss Dagstuhl - Leibniz-Zentrum fuer In-formatik, 2014.15. J. K¨arkk¨ainen and E. Ukkonen. Sparse suffix trees. In
COCOON , volume 1090 of
LNCS , pages 219–230, 1996.16. Y. Li, P. Kamousi, F. Han, S. Yang, X. Yan, and S. Suri. Memory efficient minimumsubstring partitioning. In
Proceedings of the 39th International Conference on VeryLarge Data Bases , pages 169–180. VLDB Endowment, 2013.17. K. Mehlhorn, R. Sundar, and C. Uhrig. Maintaining dynamic sequences underequality tests in polylogarithmic time.
Algorithmica , 17(2):183–198, 1997.8. N. S. Movahedi, E. Forouzmand, and H. Chitsaz. De novo co-assembly of bacterialgenomes from multiple single cells. In
BIBM , pages 1–5, 2012.19. G. Navarro and V. M¨akinen. Compressed full-text indexes.
ACM Comput. Surv. ,39(1):article 2, 2007.20. G. Nong, S. Zhang, and W. H. Chan. Two efficient algorithms for linear time suffixarray construction.
IEEE Transactions on Computers , 60(10):1471–84, 2011.21. S. J. Puglisi, W. F. Smyth, and A. Turpin. Inverted files versus suffix arrays forlocating patterns in primary memory. In
SPIRE , volume 4209 of
LNCS , pages122–133, 2006.22. M. Roberts, W. Hayes, B. R. Hunt, S. M. Mount, and J. A. Yorke. Reducing storagerequirements for biological sequence comparison.
Bioinformatics , 20(18):3363–3369, 2004.23. S. C. Sahinalp and U. Vishkin. Symmetry breaking for suffix tree construction. In
Proceedings of the 26th Annual ACM Symposium on Theory of Computing (STOC) ,pages 300–309. ACM, 1994.24. D. E. Wood and S. L. Salzberg. Kraken: ultrafast metagenomic sequence classifi-cation using exact alignments.