Theor. Comput. Sci. | 2021

Computing the multi-string BWT and LCP array in external memory

 
 
 
 
 

Abstract


Abstract Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows-Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes. In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of m strings of different lengths. The algorithm over a set of strings having constant length k has O ( m k l ) time and I/O volume, using O ( k + m ) main memory, where l is the maximum value in the LCP array.

Volume 862
Pages 42-58
DOI 10.1016/j.tcs.2020.11.041
Language English
Journal Theor. Comput. Sci.

Full Text