[PDF] Self-Index based on LZ77 (thesis)

Abstract

Domains like bioinformatics, version control systems, collaborative editing systems (wiki), and others, are producing huge data collections that are very repetitive. That is, there are few differences between the elements of the collection. This fact makes the compressibility of the collection extremely high. For example, a collection with all different versions of a Wikipedia article can be compressed up to the 0.1% of its original space, using the Lempel-Ziv 1977 (LZ77) compression scheme. Many of these repetitive collections handle huge amounts of text data. For that reason, we require a method to store them efficiently, while providing the ability to operate on them. The most common operations are the extraction of random portions of the collection and the search for all the occurrences of a given pattern inside the whole collection. A self-index is a data structure that stores a text in compressed form and allows to find the occurrences of a pattern efficiently. On the other hand, self-indexes can extract any substring of the collection, hence they are able to replace the original text. One of the main goals when using these indexes is to store them within main memory. In this thesis we present a scheme for random text extraction from text compressed with a Lempel-Ziv parsing. Additionally, we present a variant of LZ77, called LZ-End, that efficiently extracts text using space close to that of LZ77. The main contribution of this thesis is the first self-index based on LZ77/LZ-End and oriented to repetitive texts, which outperforms the state of the art (the RLCSA self-index) in many aspects. Finally, we present a corpus of repetitive texts, coming from several application domains. We aim at providing a standard set of texts for research and experimentation, hence this corpus is publicly available.

Full PDF

UUNIVERSIDAD DE CHILEFACULTAD DE CIENCIAS F´ISICAS Y MATEM ´ATICASDEPARTAMENTO DE CIENCIAS DE LA COMPUTACI ´ONAUTO-´INDICE DE TEXTO BASADO EN LZ77TESIS PARA OPTAR AL GRADO DE MAG´ISTER EN CIENCIAS,MENCI ´ON COMPUTACI ´ONSEBASTI ´AN ANDR ´ES KREFT CARRE ˜NO

PROFESOR GU´IA:GONZALO NAVARRO BADINOMIEMBROS DE LA COMISI ´ON:DIEGO ARROYUELO BILLIARDIJ ´ER ´EMY BARBAYNIEVES BRISABOAEste trabajo ha sido ﬁnanciado en parte por la Beca Conicyt de Mag´ısterNacional y por el Instituto Milenio de Din´amica Celular y Biotecnolog´ıa.SANTIAGO DE CHILEAGOSTO 2010 a r X i v : . [ c s . D S ] D ec esumen Los dominios como bioinform´atica, sistemas de versionamiento de c´odigo, sistemas deedici´on colaborativos (wikis), y otros, producen grandes colecciones de texto que sonsumamente repetitivas. Esto es, existen pocas diferencias entre los elementos de lacolecci´on. Esto permite que la compresibilidad de la colecci´on sea extremadamentealta. Por ejemplo, una colecci´on con versiones de un mismo art´ıculo de Wikipediapuede ser comprimida a un 0 .

1% de su espacio original, utilizando el esquema decompresi´on Lempel-Ziv de 1977 (LZ77).Muchas de estas colecciones repetitivas contienen grandes vol´umenes de texto. Espor eso que se requiere un m´etodo que permita almacenarlas eﬁcientemente y a la vezoperar sobre ellas. Las operaciones m´as comunes son extraer porciones aleatorias dela colecci´on y encontrar todas las ocurrencias de un patr´on dentro de la colecci´on.Un auto-´ındice es una estructura que almacena un texto en forma comprimida ypermite encontrar eﬁcientemente las ocurrencias de un patr´on. Adicionalmente losauto-´ındices permiten extraer cualquier porci´on de la colecci´on. Uno de los objetivosde estos ´ındices es que puedan ser almacenados en memoria principal. Esta carac-ter´ıstica es sumamente importante ya que el disco puede llegar a ser un mill´on deveces m´as lento que la memoria principal.La mayor´ıa de los auto-´ındices existentes est´an basados en un esquema de com-presi´on que predice los s´ımbolos siguientes en base a una cantidad ﬁja de s´ımbolosanteriores. Este esquema, sin embargo, no funciona con textos repetitivos, ya que noes capaz de reconocer todos los elementos repetidos en la colecci´on. Un esquema ques´ı captura las repeticiones es el LZ77, pero tiene el problema de no poder accederaleatoriamente el texto.En este trabajo se presenta un algoritmo para extraer substrings de un textocomprimido con un esquema Lempel-Ziv. Adicionalmente se presenta LZ-End, unavariante de LZ77 que permite extraer el texto eﬁcientemente usando espacio cercanoal de LZ77. LZ77 extrae del orden de 1 mill´on de caracteres por segundo, mientrasque LZ-End extrae m´as del doble.Nuestro resultado m´as importante es el desarrollo del primer auto-´ındice orientadoa textos repetitivos basado en LZ77/LZ-End. Su desempe˜no supera al auto-´ındiceRLCSA, el estado del arte para textos repetitivos. La compresi´on de nuestros ´ındicesllega a ser dos veces mejor en ADN y colecciones de Wikipedia que la del RLCSA.Cabe destacar que nuestro ´ındice basado en LZ77 se construye en 35% del tiemporequerido por el RLCSA, usando el 60% de espacio de construcci´on. La b´usqueda depatrones cortos es m´as r´apida que en el RLCSA, y para patrones largos la relaci´onentre espacio y tiempo es favorable a nuestros ´ındices.Finalmente, se presenta tambi´en una colecci´on de textos repetitivos provenientesde diversos dominios. Esta colecci´on est´a disponible p´ublicamente con el objetivo quese pueda convertir en un referente en experimentaci´on.

NIVERSITY OF CHILEFACULTY OF PHYSICS AND MATHEMATICSDEPARTMENT OF COMPUTER SCIENCESELF-INDEX BASED ON LZ77SUBMITTED TO THE UNIVERSITY OF CHILE IN FULFILLMENTOF THE THESIS REQUIREMENT TO OBTAIN THE DEGREE OFMSC. IN COMPUTER SCIENCESEBASTIAN KREFT

ADVISOR:GONZALO NAVARROCOMMITTEE:DIEGO ARROYUELOJ ´ER ´EMY BARBAYNIEVES BRISABOAThis work was partially funded by Conicyt’s Master Scholarship and bythe Millennium Institute for Cell Dynamics and Biotechnology (ICDB).SANTIAGO - CHILEAUGUST 2010 bstract

1% of itsoriginal space, using the Lempel-Ziv 1977 (LZ77) compression scheme.Many of these repetitive collections handle huge amounts of text data. For thatreason, we require a method to store them eﬃciently, while providing the ability tooperate on them. The most common operations are the extraction of random portionsof the collection and the search for all the occurrences of a given pattern inside thewhole collection.A self-index is a data structure that stores a text in compressed form and allowsto ﬁnd the occurrences of a pattern eﬃciently. On the other hand, self-indexes canextract any substring of the collection, hence they are able to replace the originaltext. One of the main goals when using these indexes is to store them within mainmemory. This characteristic is very important, as the disk may be 1 million timesslower than main memory.Most current self-indexes are based on a compression scheme that predicts thefollowing symbol based on the previous k symbols. However, this scheme is not wellsuited for repetitive texts as it does not capture long-range repetitions. The LZ77compression scheme does capture such repetitions, but it is not able to access thetext at random.In this thesis we present a scheme for random text extraction from text compressedwith a Lempel-Ziv parsing. Additionally, we present a variant of LZ77, called LZ-End, that eﬃciently extracts text using space close to that of LZ77. LZ77 extractsaround 1 million characters per second, while LZ-End extracts over 2 million.The main contribution of this thesis is the ﬁrst self-index based on LZ77/LZ-Endand oriented to repetitive texts, which outperforms the state of the art (the RLCSAself-index) in many aspects. The compression of our indexes is better than that ofRLCSA, being two times better for DNA and for Wikipedia articles. Our index isbuilt using just 60% of the space required by the RLCSA and within 35% of the time.Searching for short patterns is faster than on the RLCSA, and for longer patterns thespace/time trade-oﬀ is in favor of our indexes.Finally, we present a corpus of repetitive texts, coming from several applicationdomains. We aim at providing a standard set of texts for research and experimenta-tion, hence this corpus is publicly available. ontents ONTENTS CONTENTS F n ) . . . . . . . . . . . . . . . . . . . . . 353.1.2 Thue-Morse Sequence ( T n ) . . . . . . . . . . . . . . . . . . . . 363.1.3 Run-Rich String Sequence ( R n ) . . . . . . . . . . . . . . . . . 363.2 Pseudo-Real Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.3 Real Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3.1 DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3.2 Wikipedia Articles . . . . . . . . . . . . . . . . . . . . . . . . 393.3.3 Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.4 Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.1 Artiﬁcial Texts . . . . . . . . . . . . . . . . . . . . . . . . . . 403.4.2 Pseudo-Real Texts . . . . . . . . . . . . . . . . . . . . . . . . 423.4.3 Real Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 ONTENTS CONTENTS v ist of Figures ‘alabar a la alabarda$’ . . . . . . . . . 242.13 The suﬃx array for the text ‘alabar a la alabarda$’ . . . . . . . . 252.14 The BWT of the text ‘alabar a la alabarda$’ . . . . . . . . . . . 272.15 Backward Search algorithm (BWS) . . . . . . . . . . . . . . . . . . . 284.1 LZ-End extraction algorithm . . . . . . . . . . . . . . . . . . . . . . . 524.2 LZ-End construction algorithm . . . . . . . . . . . . . . . . . . . . . 574.3 Example of LZ-End construction algorithm . . . . . . . . . . . . . . . 584.4 Compression ratio for diﬀerent compressors . . . . . . . . . . . . . . . 634.5 LZ77 and LZ-End parsing times . . . . . . . . . . . . . . . . . . . . . 644.6 Total text traversed during LZ-End construction algorithm. . . . . . . 644.7 LZ extraction speed vs extracted length . . . . . . . . . . . . . . . . . 654.8 LZ extraction speed vs parsing size . . . . . . . . . . . . . . . . . . . 665.1 The suﬃx trie for the string ‘alabar a la alabarda$’ . . . . . . . . 695.2 The reverse trie for the string ‘alabar a la alabarda$’ . . . . . . . 715.3 The range structure for the string ‘alabar a la alabarda$’ . . . . . 725.4 The bitmap B of phrases for the string ‘alabar a la alabarda’ . . 74vi IST OF FIGURES LIST OF FIGURES B S . . . . . . . . . . . . . . . . . . . . . 755.6 Permutation connecting bitmap of phrases and bitmap of sources . . 765.7 Searching for secondary occurrences from T [ start, start + len ] (prelim-inary version) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.8 The depth of the sources for the string ‘alabar a la alabarda$’ . . 785.9 PrevLess algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.10 Searching for secondary occurrences from T [ start, start + len ] . . . . 806.1 Construction time and space for the indexes . . . . . . . . . . . . . . 916.2 T results (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.3 T results (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.4 DNA 0.1% results (1) . . . . . . . . . . . . . . . . . . . . . . . . . . 976.5 DNA 0.1% results (2) . . . . . . . . . . . . . . . . . . . . . . . . . . 986.6 Kernel results (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.7 Kernel results (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100A.1 F results (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115A.2 F results (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116A.3 R results (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117A.4 R results (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118A.5 Proteins 0.1% results (1) . . . . . . . . . . . . . . . . . . . . . . . . 119A.6 Proteins 0.1% results (2) . . . . . . . . . . . . . . . . . . . . . . . . 120A.7 English 0.1% results (1) . . . . . . . . . . . . . . . . . . . . . . . . 121A.8 English 0.1% results (2) . . . . . . . . . . . . . . . . . . . . . . . . 122A.9 Sources 0.1% results (1) . . . . . . . . . . . . . . . . . . . . . . . . 123A.10 Sources 0.1% results (2) . . . . . . . . . . . . . . . . . . . . . . . . 124A.11 Para results (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125A.12 Para results (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126A.13 Cere results (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127A.14 Cere results (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128A.15 Inﬂuenza results (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129A.16 Inﬂuenza results (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130A.17 Escherichia Coli results (1) . . . . . . . . . . . . . . . . . . . . . . . . 131A.18 Escherichia Coli results (2) . . . . . . . . . . . . . . . . . . . . . . . . 132A.19 Coreutils results (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133A.20 Coreutils results (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134A.21 Einstein (en) results (1) . . . . . . . . . . . . . . . . . . . . . . . . . 135A.22 Einstein (en) results (2) . . . . . . . . . . . . . . . . . . . . . . . . . 136A.23 Einstein (de) results (1) . . . . . . . . . . . . . . . . . . . . . . . . . 137A.24 Einstein (de) results (2) . . . . . . . . . . . . . . . . . . . . . . . . . 138vii IST OF FIGURES LIST OF FIGURES

A.25 World Leaders results (1) . . . . . . . . . . . . . . . . . . . . . . . . . 139A.26 World Leaders results (2) . . . . . . . . . . . . . . . . . . . . . . . . . 140viii ist of Tables D value and mean depth for the LZ indexes . . . . . . . . . . . . . . 926.3 H value and mean extraction cost for the LZ indexes . . . . . . . . . 936.4 Detailed space of LZ77 index structures . . . . . . . . . . . . . . . . . 936.5 Detailed space of LZ-End index structures . . . . . . . . . . . . . . . 94ix hapter 1Introduction In recent times we have seen a rise in the amount of digital information. This maybe attributable to the drop of the data acquisition and storage costs. Most of thisinformation is text, that is, symbol sequences representing natural language, music,source code, time series, biological sequences like DNA and proteins, and others.Despite that the examples presented above seem very diﬀerent, there is an opera-tion that arises in most applications handling those types of sequences. This operationis called text search and consists in ﬁnding all positions on the text where a givenpattern appears. This operation serves as a basis for building more complex andmeaningful operations, like ﬁnding the most common words, or ﬁnding approximatepatterns.Text search can be solved by two diﬀerent approaches. The ﬁrst scans the textsequentially looking for matches of the pattern. Classical examples of this type ofsearch are Knuth-Morris-Pratt [KMP77] and Boyer-Moore [BM77] algorithms. Thesecond way of searching is by querying an index of the text, a data structure wehave to build before performing the queries. This structure allows us to ﬁnd theoccurrences of a given pattern without scanning the whole text.To index the text we need enough space in order to store the index, and mostimportantly we need to be able to access it eﬃciently. Nowadays, storage is not adiﬃcult problem, however eﬃcient access is. In the last years the speed of hard-drives has not experienced signiﬁcant improvements. Hard-drive access times arearound 10 ms = 10 ns , while main memory access (RAM) is around 10 ns ; in otherwords, accessing secondary storage is 1 million times slower than accessing mainmemory. This problem is still present despite the appearance of solid state drives(SSD), which have access times around 0 . ms = 10 ns , being 10 thousand times1 .1 Contributions of the Thesis Chapter 1 Introduction slower than main memory. For this reason, indexes using space proportional to thecompressed text have been proposed, aiming at storing them in main memory andhandling the data directly in compressed form, rather than decompressing before usingit [ZdMNBY00, NM07]. There are some indexes that, within that compressed space,are able to replace the original text; these are called self-indexes and are obviouslypreferable as one can discard the original text.A particular kind of texts not yet fully beneﬁted by current self-indexes are repet-itive ones. These arise from domains that handle huge collections of very similarentries or documents. For example, in a DNA collection of human genomes of dif-ferent individuals, the similarity between any two DNA sequences would be close to99.9% [B + wikis , also generate very repetitive collections be-cause each revision is very similar to the previous one. The main problem is thatexisting self-indexes do not suﬃciently exploit these repetitions, being the self-indexorders of magnitude larger than the space achievable with a compression scheme thatdoes exploit the repetitions, like LZ77 [ZL77]. LZ77 parses the text into phrases sothat each phrase, except its last letter, appears previously in the text (these previousoccurrences are called sources )). It compresses by essentially replacing each phrase bya backward pointer. A recent work, aiming at adapting current self-indexes to handlelarge DNA databases of the same species [SVMN08] found that LZ77 compressionwas still much superior to capture this repetitiveness, yet it was inadequate as a for-mat for compressed storage because of its inability to retrieve individual sequencesfrom the collection. Another work [CN09, CFMPN10] shows that grammar-basedcompression can allow extraction of substrings while capturing such repetitions, yetLZ77 compression is superior to grammar compression [Ryt03, CLL + Chapter 3 : We create a public corpus of highly repetitive texts. The corpusis composed of texts coming from diﬀerent real domains like biology, source2 .2 Outline of the Thesis Chapter 1 Introduction code repositories, document repositories, and others, as well as artiﬁcial textshaving interesting combinatorial properties. This corpus is available at http://pizzachili.dcc.uchile.cl/repcorpus.html . Chapter 4 : The worst-case extraction time of a substring of length m in an LZ77parsing is O ( mH ), where H is the maximum number of times a character istransitively copied in the parsing. We present an alternative parsing, calledLZ-End, that performs very close to LZ77 in terms of compression but permitsfaster text extraction, O ( m + H ) worst-case time. This work was published inthe [KN10]. Chapter 5 : We introduce a new self-index oriented to repetitive texts and basedon the LZ77, LZ-End, and similar parsings. Let n (cid:48) be the number of phrasesof the parsing (for highly repetitive texts, n (cid:48) will be a small value). This indexuses in theory 2 n (cid:48) log n + n (cid:48) log n (cid:48) + n (cid:48) log D + O ( n (cid:48) log σ ) + o ( n ) bits of space,where σ is the size of the alphabet and D is upper-bounded by the maximumnumber of sources covering each other. It ﬁnds the occ occurrences of a patternof length m in time O ( m H + m log n (cid:48) + occ · D log n (cid:48) ). We present severalpractical variants that achieve better results, both in time and space, than the Run-length Compressed Suﬃx Array (RLCSA) [SVMN08] and the

Grammar-based Self-index [CN09, CFMPN10], the state-of-the art self-indexes orientedto repetitive texts.

Chapter 2 describes basic concepts and related work relevant to this thesis.

Chapter 3 presents a text corpus intended for repetitive text.

Chapter 4 explains the Lempel-Ziv (LZ) parsing and some of its properties. Italso introduces a new LZ variant called LZ-End, able to extract an arbitrarysubstring in constant time per extracted symbol in some cases.

Chapter 5 presents a new self-index based on LZ77-like parsings. It covers thetheoretical proposal and the considerations we made when implementing theindex.

Chapter 6 shows the experimental results of our proposed index, comparing it withthe state-of-the-art self-indexes for repetitive texts.3 .2 Outline of the Thesis Chapter 1 Introduction

Chapter 7 presents our conclusions and gives some lines of research that can befurther investigated. 4 hapter 2Basic Concepts

In this chapter we introduce the basic concepts and notation used through this thesis.Then, we present the data structures used to build our index. Finally, we present twoself-indexes oriented to repetitive texts. All logarithms in this thesis will be in base2 and we will assume that 0 log 0 = 0.

Deﬁnition 2.1. A string T is a sequence of characters drawn from an alphabet Σ.The alphabet is an ordered and ﬁnite set of size | Σ | = σ . The i -th character of astring is represented as T [ i ]. The symbol ε represents the empty string of length 0. Deﬁnition 2.2.

Given a string T , and positions i and j , the substring of T starting at i and ending at j is deﬁned as T [ i, j ] = T [ i ] T [ i + 1] . . . T [ j ]. If i > j , then T [ i, j ] = ε . Deﬁnition 2.3.

Let T be a string of length n . The preﬁxes of T are the strings T [1 , j ] , ∀ ≤ j ≤ n and its suﬃxes are the strings T [ i, n ] , ∀ ≤ i ≤ n + 1. Deﬁnition 2.4.

Let T , T be strings of length n and n , respectively. We deﬁnethe concatenation of these strings as T T = T [1] . . . T [ n ] T [1] . . . T [ n ]. Deﬁnition 2.5.

Given a string T of length n , the reverse of T is T rev = T [ n ] T [ n − . . . T [2] T [1]. Deﬁnition 2.6.

The lexicographic order ( < ) between strings is deﬁned as follows:Let a, b be characters in Σ and X, Y be strings over Σ. ε < X, ∀ X (cid:54) = εaX < bY if a < b ∨ ( a = b ∧ X < Y )5 .2 Search Problems Chapter 2 Basic Concepts Deﬁnition 2.7.

Given a string T and a pattern P (a string of length m ) both over analphabet Σ, the occurrence positions of P in T are deﬁned as O = { | X | , ∃ X, Y, T = XP Y } . Deﬁnition 2.8.

Given a string T and a pattern P , the following search problems areof interest: • exists ( P, T ) returns true iﬀ P is in T , i.e., returns true iﬀ O (cid:54) = ∅ . • count ( P, T ) counts the number of occurrences of P in T , i.e., returns occ = | O | . • locate ( P, T ) ﬁnds the occurrences of P in T , i.e., returns the set O in someorder. • extract ( T, l, r ) extracts the text substring T [ l, r ]. Remark 2.9.

Note that exists and count can be answered after performing a locate query.

Deﬁnition 2.10.

Let T be a string of length n . The zero-th order empirical entropyis deﬁned as H ( T ) = − (cid:88) c ∈ Σ n c n log n c n where n c is the number of times the character c appears in T , that is, n c /n is theempirical probability of appearance of character c .It is worth noticing that the zero-th order entropy is invariant to permutations inthe order of the text characters. The value nH ( T ) is the least number of bits neededto represent T using a compressor that gives each character a ﬁxed encoding. Deﬁnition 2.11.

Let T be a string of length n . The k -th order empirical entropy[Man01] is deﬁned as H k ( T ) = (cid:88) S ∈ Σ k (cid:12)(cid:12) T S (cid:12)(cid:12) n H (cid:0) T S (cid:1) where T S is the sequence composed of all characters preceded by string S in T .6 .3 Entropy Chapter 2 Basic Concepts The value nH k ( T ) is the least number of bits needed to represent T using acompressor that encodes each character taking into account the k preceding charactersin T . This value assumes the ﬁrst k characters are encoded for free, thus it gives arelevant lower bound only when n (cid:29) k . H k is a decreasing function in k , that is,0 ≤ H k ( T ) ≤ H k − ( T ) ≤ . . . ≤ H ( T ) ≤ H ( T ) ≤ log σ. The following lemma yields the ground to show that the empirical entropy H k isnot a good lower-bound measure for the compressibility of repetitive texts. Lemma 2.12.

Let T be a string of length n . For any k ≤ n it holds H k ( T T ) ≥ H k ( T ) .Proof. As new relevant contexts may have arisen in the concatenation

T T , we denoteby C ( T, k ) the contexts of length k present in T , and C ( T T, k ) the contexts of

T T .We have that C ( T, k ) ⊆ C ( T T, k ). The number of new contexts in

T T is at most k .For each S ∈ C ( T, k ), we have (

T T ) S = T S A S T S , for some A S such that | A S | ≤ k .Then, H k ( T T ) = 1 | T T | (cid:88) S ∈ C ( T T,k ) | ( T T ) S | H (( T T ) S ) ≥ | T | (cid:88) S ∈ C ( T,k ) | T S A S T S | H ( T S A S T S ) ≥ | T | (cid:88) S ∈ C ( T,k ) | T S T S | H ( T S T S )= 1 | T | (cid:88) S ∈ C ( T,k ) | T S | H ( T S )= H k ( T ) . In the ﬁrst step we used C ( T, k ) ⊆ C ( T T, k ); in the second we used | T | H ( T ) ≤| T A | H ( T A ), for T = T S T S and A = A S (since | T S A S T S | H ( T S A S T S ) = | T S T S A S | H ( T S T S A S )); and in the third we used H ( T T ) = H ( T ). The second7 .4 Encodings Chapter 2 Basic Concepts property holds because | T A | H ( T A ) = (cid:88) c ∈ Σ ( n Tc + n Ac ) log n T + n A n Tc + n Ac ≥ n T (cid:88) c ∈ Σ n Tc n T log n T + n A n Tc + n Ac ≥ n T (cid:88) c ∈ Σ n Tc n T log n T n Tc = | T | H ( T )where n Xc is the number of occurrences of character c in string X , and n X = | X | for X = T or A . The last line is justiﬁed by the Gibbs inequality [Ham86].It follows that | T T | H k ( T T ) ≥ | T | H k ( T ), that is, to encode T T this model usesat least twice the space of the one used to encode T . An LZ77 encoding would needjust one more phrase, as seen later. Most data structures need to represent symbols and numbers. Classic data structuresuse a ﬁxed amount of space to store them, for example 1 byte for characters and 4bytes for integers. Instead, compressed data structures aim to use the minimumpossible space, thus they represent symbols using variable-length preﬁx-free codes orjust using a ﬁxed amount b of bits, where b is as small as possible. Table 2.1 showsdiﬀerent encodings for the integers 1,. . . ,9, which we describe next. Unary Codes

This representation is the simplest and serves as a basis for othercoders. It represents a positive n as n − , thus it uses exactly n bits. Gamma Codes

It represents a positive n by concatenating the length of its binaryrepresentation in unary and the binary representation of the symbol, omittingthe most signiﬁcant bit. The space is 2 (cid:98) log n (cid:99) + 1, (cid:98) log n (cid:99) + 1 for the lengthand (cid:98) log n (cid:99) for the binary representation. Delta Codes

This is an extension of γ -codes that works better on larger numbers.It represents the length of the binary representation of n using γ -codes and then n in binary without its most signiﬁcant bit, thus using { (cid:98) log( (cid:98) log n (cid:99) + 1) (cid:99) ++1 } + (cid:98) log n (cid:99) bits. 8 .4 Encodings Chapter 2 Basic Concepts Symbol Unary Code γ -Code δ -Code Binary( b = 4) Vbyte( b = 2)1 0 0 0 0001 0012 10 100 1000 0010 0103 110 101 1001 0011 0114 1110 11000 10100 0100 0011005 11110 11001 10101 0101 0011016 111110 11010 10110 0110 0011107 1111110 11011 10111 0111 0011118 11111110 1110000 11000000 1000 0011001009 111111110 1110001 110000001 1001 001100101Table 2.1: Example of diﬀerent coders Vbyte Coding [WZ99] It splits the (cid:98) log( n + 1) (cid:99) bits needed to represent n intoblocks of b bits and stores each block in a chunk of b + 1 bits. The highest bitis 0 in the chunk holding the most signiﬁcant bits of n , and 1 in the rest ofthe chunks. For clarity we write the chunks from most to least signiﬁcant, justlike the binary representation of n . For example, if n = 25 = 11001 and b = 3,then we need two chunks and the representation is 0011 · (cid:98) log( n + 1) (cid:99) bits, this code loses one bit per b bits of n ,plus possibly an almost empty ﬁnal chunk. Even when the best choice for b isused, the total space achieved is still worse than δ -encoding’s performance. Inexchange, Vbyte codes are very fast to decode. In many cases we need to store a set of numbers using the least possible space,yet providing fast random access to each element. Variable-length codes complicatethis task, as they require storing, in addition, pointers to sampled positions of theencoded sequence.A simple solution that shows good performance in practice is the so-called

DirectlyAddressable Codes (DAC) [BLN09], a variant of

Vbytes [WZ99]. They start witha sequence C = C , . . . , C n of n integers. Then they compute the Vbyte encodingof each number. The least signiﬁcant blocks are stored contiguously in an array A ,and the highest bits of the least signiﬁcant chunks are stored in a bitmap B . Theremaining chunks are organized in the same way in arrays A i and bitmaps B i , storingcontiguously the i -th chunks of the numbers that have them. Note that arrays A i .5 Bitmaps Chapter 2 Basic Concepts store contiguously the bits ( i − · b + 1 , . . . , i · b and bitmaps B i store whether anumber has further chunks or not, hence the name Reordered Vbytes .Figure 2.1 shows an example of the resulting structure. The ﬁrst element isrepresented with two blocks, thus, A [0] = C , , A [0] = C , , B [0] = and B [0] = . C C , C , C , C , C , C , C , . . .C , C , . . . . . . . . . A B A B A B C , C , C , C , . . .. . .. . .C , C , C , C , C , Figure 2.1: Example of Directly Addressable Codes structureTo access the element at position i = i we check whether B [ i ] is set. If it isnot set, this is the last chunk and we already have the value C [ i ] = A [ i ], otherwisewe have to fetch the following chunks. In that case, we recompute the position as i = rank ( B , i ), where rank ( B , i ) is the number of ones up to position i onbitmap B (see Section 2.5 for further details). If B [ i ] is not set we are done with C [ i ] = A [ i ] + A [ i ] · b , otherwise we set i = rank ( B , i ) and continue in thefollowing levels. Accessing a random element takes O (log( M ) /b ) worst case time,where M = max C i . However, the access time is lower for elements with shortercodewords, which are usually the most frequent ones.We will use the implementation of Susana Ladra (available by personal request)in this thesis. Let B a binary sequence over Σ = { , } (a bitmap) of length n and assume it has m ones. We are interested in solving the following operations: Universidade da Coru˜na, Spain. [email protected] .5 Bitmaps Chapter 2 Basic Concepts

105 20 2515 30

Figure 2.2: Example of rank and selectVariant Size Rank SelectClark n + o ( n ) O (1) O (1)RRR nH ( B ) + o ( n ) O (1) O (1)esp nH ( B ) + o ( n ) O (1) O (1)recrank 1 . m log nm + m + o ( n ) O (cid:0) log nm (cid:1) O (cid:0) log nm (cid:1) vcode m log( n/ log n ) + o ( n ) O (log n ) O (log n )sdarray m log nm + 2 m + o ( m ) O (cid:16) log nm + log m log n (cid:17) O (cid:16) log m log n (cid:17) Table 2.2: Complexities for binary rank and select • rank b ( B, i ) : How many b ’s are up to position i (included). • select b ( B, i ) : The position of the i -th b bit. Example 2.13.

Figure 2.2 shows an example of the operations rank and select. Weshow the values of both rank ( B,

20) = 11 and rank ( B,

20) = 9. Note that thesetwo values add up to 20, since the former returns the number of ones up to position20, and the latter the number of zeroes. Also, access simply returns the bit storedat that position, in our case at position 20 there is a 1. Finally, we show the valueof select ( B,

11) = 20, which was expected since access ( B,

20) = 1. The value of select ( B,

9) is 19.Several solutions have been proposed to address this problem. The ﬁrst solutionable to solve both kinds of queries in constant time uses n + O ( n log log n log n ) bits of space[Cla96]. Raman, Raman and Rao’s solution (RRR) [RRR02] achieves nH ( B ) + O ( n log log n log n ) bits and answer the queries in constant time. Okanohara and Sadakane[OS07] proposed several alternatives tailored to the case of small m (sparse bitmaps): esp , recrank , vcode , and sdarray . Table 2.2 shows the time and space complexitiesof these solutions. Note that the reported spaces include the representation of thebitmap. 11 .5 Bitmaps Chapter 2 Basic Concepts The extra o ( n ) space of theoretical solutions [Cla96] is large in practice. Gonz´alez et al. [GGMN05] proposed a solution with good results in practice and small spaceoverhead (up to 5%). This implementation is very simple, yet its practical perfor-mance is better than classical solutions. They store the plain bitmap in an array B and have a table R s where they store rank ( B, i · s ), where s = 32 k , where k is aparameter for the frequency of the sampling of the bit vector. They use a functioncalled popcount that counts the number of bits in a word (4 bytes). This operationcan be solved bit by bit, but it is easy to improve it, using either bit parallelism orprecomputed tables, requiring thus just a few operations. They solve the operationsas follows ( rank and select are obvious variations): • rank ( B, i ): They start in the last entry of R s that precedes i ( R s [ (cid:98) i/s (cid:99) ]),and then sequentially scan the array B , popcounting consecutive words, untilreaching the desired position. The popcounting of the last word is done by ﬁrstsetting all bits after position i to zero, which is done in constant time using amask. Thus the time is O ( k ). • select ( B, i ): They ﬁrst binary search the R s table for the last position p where R s [ p ] ≤ i . Then they scan B sequentially using popcount looking for the wordwhere the desired select position is. Finally they ﬁnd the desired position inthe word by sequentially scanning the word bit by bit. Thus the time is O ( k + log nk ).We will use the implementation of Rodrigo Gonz´alez (available at http://code.google.com/p/libcds ) in this thesis. When the bitmap is very sparse (i.e., the number of ones in the bitmap is verylow) one practical solution is to δ -encode the distances between consecutive ones.Additionally we need to store absolute sample values select ( B, i · s ) for a samplingstep s , plus pointers to the corresponding positions in the δ -encoded sequence. Wesolve the operations as follows: • select ( B, i ) is solved within O ( s ) time by going to the last sampled positionpreceding i and decoding the δ -encoded sequence from there.12 .6 Wavelet Trees Chapter 2 Basic Concepts • rank ( B, i ) is solved in time O ( s + log ms ). First, we binary search the sampleslooking for the last sampled position such that select ( B, (cid:96) · s ) ≤ i . Start-ing from that position we sequentially decode the bitmap and stop as soon as select ( B, p ) ≥ i . • access ( B, i ) is solved in time O ( s + log ms ) in a way similar to rank.The space needed by the structure is W + n/s ( (cid:98) log m (cid:99) + 1 + (cid:98) log W (cid:99) + 1), where W is the number of bits needed to represent all the δ -codes. In the worst case W = 2 m (cid:98) log( (cid:98) log nm (cid:99) + 1) (cid:99) + m (cid:98) log nm (cid:99) + m = m log nm + O ( m log log nm ).This structure allows a space-time trade-oﬀ related to s and also has the propertythat several operations cost O (1) after solving others. For example, select ( B, p ) and select ( B, p + 1) cost O (1) after solving p ← rank ( B, i ). A wavelet tree [GGV03] is an elegant data structure that stores a sequence S of n symbols from an alphabet Σ of size σ . This structure supports some basic queriesand is easily extensible to support others.We split the alphabet into two halves L and R , so that the elements of L arelexicographically smaller than those of R . Then, we create a bitmap B of size n setting B [ i ] = if the symbol at position i belongs to L and B [ i ] = otherwise. Thisbitmap is stored at the root of the tree. Afterward, we extract from S all symbolsbelonging to L , generating sequence S L , and all symbols belonging to R , generatingsequence S R (these sequences are not stored). Finally, we recursively generate theleft subtree on S L and the right subtree on S R . We continue until we get a sequenceover a one-letter alphabet. Figure 2.3 shows the wavelet tree for the example text alabar a la alabarda . Only the bitmaps (black color) are stored in the tree. Thelabels of the tree show (gray color) the subsets L and R and the strings over thebitmaps (gray color) show the conceptual subsequences S L and S R .The resulting tree has σ leaves, height (cid:100) log σ (cid:101) , and n bits per level. Thus the spaceoccupancy is n log σ bits, plus o ( n log σ ) (more precisely, O ( n log σ log log n log n )) additionalbits to support rank and select queries on the bitmaps.In the following we explain how this structure supports the operations access , rank and select on S . The last two operations are just a generalization for largeralphabets of those deﬁned in Section 2.5.13 .6 Wavelet Trees Chapter 2 Basic Concepts ____ _ __ _ __ _ _ Figure 2.3: Example of a wavelet tree for the text alabar a la alabarda • Access:

To retrieve the symbol S [ i ] we look at B [ i ] at the root. If it is a we go to left subtree, otherwise to the right subtree. The new position is i ← rank ( B, i ) if we go to the left and i ← rank ( B, i ) if we go to the right.This procedure continues recursively until we reach a leaf. The bits read in thepath from the root to the leaf represent the symbol sought. • Rank:

To count how many c ’s are up to position i we go to the left if c is in L and otherwise to the right. The new position is i ← rank ( B, i ) if we go tothe left and i ← rank ( B, i ) if we go to the right, where B is the bitmap of theroot. When we reach a leaf the answer is i . • Select:

To ﬁnd the i -th symbol c we ﬁrst go to the leaf corresponding to c andthen go upwards to the root. Let B the bitmap of the parent. If the currentnode is a left child then the position at the parent is i ← select ( B, i ), otherwiseit is i ← select ( B, i ). When we reach the root the answer is the current i value.The running time of these operations is O (log σ ), since we use a bitmap supportingconstant-time rank, select and access. Example 2.14.

Figure 2.4 shows an example of how we retrieve the 11th symbol ofsequence S ( access ( S,

11) = a ). First we access the bitmap of the root and see that atposition 11 there is a 0. Hence we descend to the left. Then using rank ( B,

11) = 8we count how many zeroes are up to position 11. This value is our new position in14 .6 Wavelet Trees Chapter 2 Basic Concepts ____ _ __ _ __ _ _

Figure 2.4: Example of access in a wavelet treethe next level. Then we continue the process until we reach a leaf; in that case thesymbol stored in that lead is the symbol sought, in our case an ‘a’ . Example 2.15.

Figure 2.5 shows step by step how we compute rank l ( S,

11) = 2.Since symbol ‘l’ is mapped to a 1 we descend from the root to the right child. Using rank ( B,

11) = 3 we count the number of ones up to that position. This is our newposition in the next level. Then we continue the process until we reach a leaf. Thevalue sought is the last value of rank, in our case 2.

Example 2.16.

Figure 2.6 shows an example of how to select the second ‘b’ inthe sequence S ( select b ( S, ‘b’ .Since that symbol was last mapped to a 1, we go to the parent and compute our newposition as select ( B,

2) = 12. In that level, ‘b’ was mapped to a 0, so we go to theparent and the new position is select ( B,

12) = 16, and that is the value sought.

A direct application of wavelet trees is to answer range search queries [MN07]. Thismethod is very similar to the idea of Chazelle [Cha88].

Deﬁnition 2.17.

Given a subset R ( | R | = t ) of the discrete range [1 , n ] × [1 , σ ], a range query returns the points p ∈ R belonging to a range [ x , x ] × [ y , y ].15 .6 Wavelet Trees Chapter 2 Basic Concepts ____ _ __ _ __ _ _ Figure 2.5: Example of rank in a wavelet tree ____ _ __ _ __ _ _

Figure 2.6: Example of select in a wavelet tree16 .6 Wavelet Trees Chapter 2 Basic Concepts (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)

Figure 2.7: Example of 2-dimensional range queryAn extension of the wavelet tree supports range queries using n + t log n + o ( n + t log n ) bits, counting the number of points within the range in time O (log n ) andreporting each occurrence in time O (log n ). We will use a modiﬁed version of theimplementation of Gonzalo Navarro .We explain here a simpliﬁed version for the case in which there exists exactlyone point for each value of x . We order the points of R by their x coordinate andcreate the sequence S [1 , n ], such that for each ( x, y ) ∈ R , S [ x ] = y . Then we buildthe wavelet tree of S . Example 2.18.

Figure 2.7 shows a grid, with exactly one y value for each value of x . The ﬁgure shows in yellow the range [17 , × [9 , , × [1 , , × [9 , Check the LZ77-index source code ( http://pizzachili.dcc.uchile.cl/indexes/LZ77-index ) for the updated version .6 Wavelet Trees Chapter 2 Basic Concepts Projecting

A range in S represents a range along the x coordinate and the splits made by thewavelet tree deﬁne ranges along the y coordinate. Every time we descend to a child ofa node we need to know where the range represented in that child is. The operationof determining the range deﬁned by a child, given the range of the parent, is called projecting . Using rank we project a range downwards. Given a node with bitmap B the left projection of [ x, x (cid:48) ] is [1 + rank ( B, x − , rank ( B, x (cid:48) )] and the rightprojection is [1 + rank ( B, x − , rank ( B, x (cid:48) )]. A range [ y, y (cid:48) ] along the y coordinateis projected to the left as [ y, (cid:98) ( y + y (cid:48) ) / (cid:99) ] and to the right as [ (cid:98) ( y + y (cid:48) ) / (cid:99) + 1 , y (cid:48) ]. Counting

We start from the root with the one-dimensional ranges [ x, x (cid:48) ] = [ x , x ] and [ y, y (cid:48) ] =[1 , n ] and project them in both subtrees. We do this recursively until:1. [ x, x (cid:48) ] = ∅ ;2. [ y, y (cid:48) ] ∩ [ y , y ] = ∅ ; or3. [ y, y (cid:48) ] ⊆ [ y , y ], in which case we add x (cid:48) − x + 1 to the total.As the interval [ y , y ] is covered by O (log n ) maximal wavelet tree nodes, the totaltime to count the occurrences is O (log n ). Example 2.19.

Figure 2.8 shows the wavelet tree that represents the range of Figure2.7. The ﬁgure represents how to count the occurrences in the range [17 , × [9 , ,

19] in the x coordinate is projected down-wards. The nodes below the blue line are those whose y range is contained in therange [9 , Locating

To locate the actual points we start from each node in which we were counting. Ifwe want to know the x coordinate we go up using select and if we want to know the18 .7 Permutations Chapter 2 Basic Concepts

137 9 11 8 3 1 5 4 10 2 63 1 5 4 2 63 1 2 5 4 61 2 321 7 9 11 8 1021 12 9 20 11 8 3 15 1 13 5 17 4 19 10 2 14 6 18167

21 12 20 15 13 17 16 19 14 18

13 14 17 1812 15 16 21 20 19

12 15 1614 21 2017 19 18

87 12 13 14 15 16 17 18 19 20 2112 13 17 18

Figure 2.8: Example of counting the occurrences in a 2-dimensional range query usinga wavelet tree y coordinate we go down using rank. This operation takes O (log n ) for each pointlocated. A permutation is a bijection π : [1 , n ] → [1 , n ], and we are interested in computingeﬃciently both π ( i ) and π − ( i ) for any 1 ≤ i ≤ n . The permutation can be representedin a plain array using n log n bits, by storing P = [ π (1) , . . . , π ( n )]. This answers π ( i )in constant time. Solving π − ( i ) can be done by sequentially scanning P for theposition j where π ( j ) = i . A more eﬃcient solution [MRRR03] is based on the cyclesof a permutation. A cycle is a sequence i, π ( i ) , π ( i ) , . . . , π k ( i ) such that π k +1 ( i ) = i .Every i belongs to exactly one cycle. Then, to compute π − we repeatedly apply π over i , ﬁnding the element e of the cycle such that π ( e ) = i . These solutions do notrequire any extra space to compute π − ( i ), but they take O ( n ) time in the worst case.Representing the sequence π [1 , n ] with a wavelet tree one can answer both queriesusing O (log n ) time and n log n + o ( n log n ) bits of space. A faster solution [MRRR03]is based on the cycles of the permutation. By introducing shortcuts in the cycles, ituses (1 + ε ) n log n + O ( n ) bits and solves π ( i ) in constant time and π − ( i ) in O (1 /ε )time, for any ε >

0. 19 .8 Tree Representations Chapter 2 Basic Concepts

We will use the implementation of Munro et al. ’s shortcut technique by DiegoArroyuelo , available at http://code.google.com/p/libcds . A classical representation of a general tree of n nodes requires O ( nw ) bits of space,where w ≥ log n is the bit length of a machine pointer. Typically only operations suchas moving to the ﬁrst child and to the next sibling, or to the i -th child, are supportedin constant time. By further increasing the constant, some other simple operations areeasily supported, such as moving to the parent, knowing the subtree size, or the depthof the node. However, the Ω( n log n ) bit space complexity is excessive in terms ofinformation theory. The number of diﬀerent general trees of n nodes is C n ≈ n /n / ,hence log C n = 2 n − Θ(log n ) bits are suﬃcient to distinguish any one of them.There are several succinct tree representations that use 2 n + o ( n ) bits of space andanswer most queries in constant time (see the review by Arroyuelo et al. [ACNS10]for a detailed exposition); here we explain the DFUDS [BDM +

05] representation asthis is the one that meets our requirements.

Deﬁnition 2.20.

A sequence S drawn from alphabet Σ = { , } is said to be balanced if: (1) there are as many s as s and (2) at any position i the number of zeroes tothe left is greater or equal than the number of ones (i.e., rank ( S, i ) ≥ rank ( S, i )).Usually a balanced sequence is referred as balanced parentheses by identifying as ‘(’and as ‘)’, as the nesting of parentheses satisﬁes the above deﬁnition.The operations deﬁned over a balanced sequence are: (1) ﬁndclose(S,i) (ﬁnd-open(S,i)) ﬁnds the matching ( ) of the ( ) at position i , and (2) enclose(S,i) isthe position of tightest enclosing node i . Deﬁnition 2.21 ([BDM + . The Depth-ﬁrst unary degree sequence (DFUDS) isgenerated by a depth-ﬁrst traversal of the tree, at each node appending the degree ofthe node in unary. Additionally a leading is prepended to the sequence to make itbalanced and allow the concatenation of several such encodings into a forest.The DFUDS sequence represents the topology of the tree using 2 n bits. Tree nodesare identiﬁed in the DFUDS sequence according to their rank in the order given by thedepth-ﬁrst traversal (more precisely, the i -th node is identiﬁed by position select ( i ) Yahoo! Research, Chile. [email protected] .8 Tree Representations Chapter 2 Basic Concepts in the DFUDS encoding). Figure 2.9 shows the DFUDS bit-sequence for the exampletree. The red 1 in the sequence is the preceding 1 added to make the sequencebalanced. The green node is represented by the 10th 1 in the sequence, as it is the10th node visited during a depth-ﬁrst traversal. The blue sequence of ﬁve 1s and one0 is the degree of the blue node. Figure 2.9: Example of DFUDS representationTo solve the common operations over trees two data structures are built overthe DFUDS sequence: a bitmap data structure supporting rank and select (Section2.5) and a data structure solving operations ﬁndclose , ﬁndopen and enclose [Jac89,MR01, Nav09]. These structures allow one to compute the most common operationsin constant time using o ( n ) additional bits of space. Additionally, if we use labeledtrees we need to store the labels of the edges in an array chars , using n log σ additionalbits, where σ is the labels’ alphabet size. The label of the edge pointing to the i -thchild of node x is at chars [ rank ( df uds, x ) + i ]. The operations we are interested infor this thesis are: • degree( x ) : number of children of node x . • isLeaf( x ) : whether node x is a leaf. • child( x , i ) : i -th child of node x . • labeledChild( x , c ) : child of node x labeled by symbol c . • leftmostLeaf( x ) : leftmost leaf of the subtree starting at node x . • rightmostLeaf( x ) : rightmost leaf of the subtree starting at node x . • leafRank( x ) : number of leaves to the left of node x .21 .9 Tries Chapter 2 Basic Concepts • preorder( x ) : preorder position of node x .All these operations can be solved theoretically in constant time; however, in practice labeledChild is solved by binary searching the labels of the children, because it is mucheasier to implement and fast enough in practice. To solve leftmostLeaf , rightmostLeaf and leafRank we need to solve the queries rank ( i ) and select ( i ). Rank ( i ) returnsthe number of occurrences of the substring in the bitmap up to position i and select ( i ) returns the position p of the i -th occurrence of the substring in thebitmap. Solving these queries requires an additional data structure that uses o ( n )bits. It uses the same ideas as the one for solving rank and select for binary alphabets.We will use a modiﬁed version of the implementation of Diego Arroyuelo availableat http://code.google.com/p/libcds , adding support for leaf-related operations. A trie or digital tree is a data structure that stores a set of strings. It can ﬁnd theelements of the set preﬁxed by a pattern in time proportional to the pattern length. Deﬁnition 2.22. A trie for a set S of distinct strings is a tree where each noderepresents a distinct preﬁx in the set. The root node represents the empty preﬁx ε .A node v representing preﬁx Y is a child of node u representing preﬁx X iﬀ Y = Xc for some character c , which labels the edge between u and v .We suppose that all strings are ended by a special symbol $ , not present in thealphabet. We do this in order to ensure that no string S i is a preﬁx of some otherstring S j . This property guarantees that the tree has exactly | S | leaves. Figure 2.10shows an example of a trie.A trie for the set S = { S , . . . , S n } is easily built on O ( | S | + . . . + | S n | ) timeby successive insertions (assuming we can descend to any child in constant time). Apattern P is searched for in the trie starting from the root and following the edgeslabeled with the characters of P . This takes a total time of O ( | P | ).A compact trie is an alternative representation that reduces the space of thetrie by collapsing unary nodes into a single node and labeling the edge with theconcatenation of all labels. A PATRICIA tree [Mor68], an alternative that uses evenless space, just stores the ﬁrst character of the label string and its length. This variantis used when the strings S are available separately, as not all information is storedin the edges. In this variant, after the search we need to check if the preﬁx found22 .10 Suﬃx Trees Chapter 2 Basic Concepts Figure 2.10: Example of a trie for the set S = { ‘alabar’ , ‘a’ , ‘la’ , ‘alabarda’ } .actually matches the pattern. For doing so, we have to extract the text correspondingto any string with the preﬁx found and compare it with the pattern. It they are equalthen all leaves will be occurrences (i.e., strings preﬁxed with the pattern), otherwisenone will be an occurrence. Figure 2.11 shows an example of this kind of trie. (a,1) ($,1)(l,5) ($,1)(d,3)(l,3) Figure 2.11: Example of a PATRICIA trie for the set S = { ‘alabar’ , ‘a’ , ‘la’ , ‘alabarda’ } . The values in parentheses are respectivelythe ﬁrst character of the label and the length of the label. Deﬁnition 2.23. A suﬃx trie is a trie composed of all the suﬃxes T [ i, n ] of a giventext T [1 , n ]. The leaves of the trie store the positions where the suﬃxes start. Deﬁnition 2.24 ([Wei73, McC76]) . A suﬃx tree is a PATRICIA tree built over allthe suﬃxes T [ i, n ] of a given text T [1 , n ]. The leaves in the tree indicate the text23 .10 Suﬃx Trees Chapter 2 Basic Concepts positions where the corresponding suﬃxes start.Figure 2.12 shows the suﬃx tree for the text ‘alabar a la alabarda$’ . ($,1) (r,1)(l,1)(a,1)(_,1) (l,1)($,1) (r,1)(l,2)(d,1)(b,3)(_,1) (d,1) (_,1)(b,3) (d,1)(_,1)(d,1)(_,1)(l,5)(b,3)(_,1)(a,1)(l,1)(_,1) (d,1) (_,1) (d,1) (_,1) (d,1)(a,1) (_,1) Figure 2.12: The suﬃx tree for the text ‘alabar a la alabarda$’

The suﬃx tree can be built in O ( n ) time using O ( n log n ) bits of space [McC76,Ukk95].A suﬃx tree is able to ﬁnd all the occ occurrences of a pattern P of length m in time O ( m + occ ), i.e., to solve the locate query described in Section 2.2. Afterdescending by the tree according to the characters of the pattern, we could be inthree diﬀerent cases: i) we reach a point in which there is no edge labeled with thecurrent character of P , which means that the pattern does not occur in T ; ii) weﬁnish reading P in an internal node (or in the middle of an edge), in which case thesuﬃxes of the corresponding subtree are either all occurrences or none, therefore weonly need to check if one of those suﬃxes matches the pattern P ; iii) we end up in aleaf without consuming all the pattern, in which case at most one occurrence is foundafter checking the suﬃx with the pattern. As a subtree with occ leaves has O ( occ )nodes, the total time for reporting the occurrences is as stated above.24 .11 Suﬃx Arrays Chapter 2 Basic Concepts The suﬃx tree can solve the queries count and exists in O ( m ) time. The processis similar to that of locate . First we descend the tree according to the pattern. Then,we check if one of the suﬃxes of the subtree is a match. If it is a match the answerof count is the number of leaves of the subtree (for which we need to store in eachinternal node the number of leaves that descend from it), otherwise it is zero. Deﬁnition 2.25 ([MM93, GBYS92]) . A suﬃx array A [1 , n ] is a permutation of theinteger interval [1 , n ], holding T [ A [ i ] , n ] < T [ A [ i + 1] , n ] for all 1 ≤ i < n . In otherwords, it is a permutation of the suﬃxes of the text such that the suﬃxes are lexico-graphically sorted.Figure 2.13 shows the suﬃx array for the text ‘alabar a la alabarda$’ . Thecharacter $ is the smallest one in lexicographical order. The zone highlighted in grayrepresents those suﬃxes starting with ‘a’ .Figure 2.13: The suﬃx array for the text ‘alabar a la alabarda$’ Note that the suﬃx array could be computed by collecting the values at the leavesof the suﬃx tree. However, several methods exist that compute the suﬃx array in O ( n ) or O ( n log n ) time, using signiﬁcantly less space. For a complete survey see[PST07].The suﬃx array can solve locate queries in O ( m log n + occ ) time, and count and exists queries in O ( m log n ) time. First, we search for the interval A [ sp , ep ] of thesuﬃxes starting with P [1]. This can be done via two binary searches on A . The ﬁrstbinary search determines the starting position sp for the suﬃxes lexicographicallylarger than or equal to P [1], and the second determines the ending position ep forsuﬃxes that start with P [1]. Then, we consider P [2], narrowing the interval to A [ sp , ep ], holding all suﬃxes starting with P [1 , P .12 Backward Search Chapter 2 Basic Concepts is fully consumed or the current interval becomes empty. Note that this algorithmsearches for the pattern from left to right. For each character of the pattern, we do twobinary searches taking at most time O (log n ), hence the total time is O ( m log n ). Then locate reports all occurrences in O ( occ ) time and the answer to count is ep m − sp m + 1.We can also directly search for the interval A [ sp, ep ] where the suﬃxes start with thepattern P using just two binary searches on A , which ﬁnd the ﬁrst and last positionwhere the suﬃxes start with P . Each comparison between the pattern and a suﬃxwill take at most O ( m ) time, hence the total running time is also O ( m log n ). Yet,this is faster in practice than the previous method and is what we use in this thesis. Backward search is an alternative method for ﬁnding the interval [ sp, ep ] correspond-ing to a pattern P in the suﬃx array. It searches for the pattern from right to left,and is based on the Burrows-Wheeler transform. Deﬁnition 2.26 ([BW94]) . Given a text T terminated with the special character T [ n ] = $ smaller than all others, and its suﬃx array A [1 , n ], the Burrows-Wheelertransform (BWT) of T is deﬁned as T bwt [ i ] = T [ A [ i ] − A [ i ] = 1,where T bwt [ i ] = T [ n ]. In other words, the transformation is conceptually built ﬁrstby generating all the cyclic shifts of the text, then sorting them lexicographically, andﬁnally taking the last character of each shift. In practice it can be built in linear timeby building the suﬃx array ﬁrst.We can think of the sorted list of cyclic shift as a conceptual matrix M [1 , n ][1 , n ].Figure 2.14 shows an example of how the BWT is computed for the text ‘alabar a la alabarda$’ . This transformation has the advantage of being easilycompressed by local compressors [Man01]. It can be reversed as follows. Deﬁnition 2.27.

The LF-mapping LF ( i ) maps a position i in the last column of M ( L = T bwt ) to its occurrence in the ﬁrst column of M ( F ). Lemma 2.28 ([FM05]) . It holds LF ( i ) = C [ c ] + rank c ( T bwt , i ) where c = T bwt [ i ] and C [ c ] is the number of symbols smaller than c in T . Lemma 2.29 ([BW94]) . The LF-mapping allows one to reverse the Burrows-Wheelertransform. .12 Backward Search Chapter 2 Basic Concepts Figure 2.14: The BWT of the text ‘alabar a la alabarda$’

Proof.

We know that T [ n ] = $ and since $ is the smallest symbol, T [ n ] = F [1] = $ and thus T [ n −

1] = L [1] = T bwt [1]. Using the LF-mapping we compute i = LF (1);knowing that T [ n −

1] is at F [ i ], we have T [ n −

2] = L [ i ], as L [ i ] always precedes F [ i ]in T . In general, it holds T [ n − k ] = T bwt [ LF k − (1)].Given the close relation between the suﬃx array and the BWT, it is natural toexpect that a search algorithm can work on top of the BWT. Such algorithm iscalled backward search (BWS), and at each stage it narrows the interval [ sp i , ep i ] ofthe suﬃx array in which the suﬃxes start with P [ i, m ], starting from i = m andending with i = 1. Narrowing the interval A [ sp, ep ] with a new character c is called a BW S ( sp, ep, c ) step and it is done very similarly to the LF-mapping (Lemma 2.28).BWS searches a pattern from right to left, opposite to the search on suﬃx arrays,that searches for a pattern from left to right.Figure 2.15 shows the backward search algorithm. Lines 5-7 correspond to theBWS step. 27 .13 Lempel-Ziv Parsings and Repetitions Chapter 2 Basic Concepts BWS ( P ) i ← len ( P ) sp ← ep ← n while sp ≤ ep and i ≥ do c ← P [ i ] sp ← C [ c ] + rank c ( T bwt , sp −

1) + 1 ep ← C [ c ] + rank c ( T bwt , ep ) i ← i − if sp > ep then return ∅ return ( sp, ep )Figure 2.15: Backward Search algorithm (BWS) Lempel and Ziv proposed in the seventies a new compression method [LZ76, ZL77,ZL78]. The basic idea is to replace a repeated portion of the text with a pointerto some previous occurrence of that portion. To ﬁnd the repetitions they keep adictionary representing all the portions that can be copied. Many variants of thesealgorithms exist [SS82, Wel84, Wil91] which diﬀer in the way they parse the text orthe encoding they use.The LZ77 [ZL77] parsing is a dictionary-based compression scheme in which thedictionary used is the set of substrings of the preceding text. This deﬁnition allowsit to get one of the best compression ratios for repetitive texts.

Deﬁnition 2.30 ([ZL77]) . The

LZ77 parsing of text T [1 , n ] is a sequence Z [1 , n (cid:48) ] of phrases such that T = Z [1] Z [2] . . . Z [ n (cid:48) ], built as follows. Assume we have alreadyprocessed T [1 , i −

1] producing the sequence Z [1 , p − T [ i, i (cid:48) −

1] of T [ i, n ] which occurs in T [1 , i − set Z [ p ] = T [ i, i (cid:48) ] and continuewith i = i (cid:48) + 1. The occurrence in T [1 , i −

1] of the preﬁx T [ i, i (cid:48) −

1] is called the source of the phrase Z [ p ].Note that each phrase is composed of the content of a source, which can be theempty string ε , plus a trailing character. Note also that all phrases of the parsing The original deﬁnition allows the source of T [ i, i (cid:48) −

1] to extend beyond position i −

1, but weignore this feature in this thesis. .13 Lempel-Ziv Parsings and Repetitions Chapter 2 Basic Concepts are diﬀerent, except possibly the last one. To avoid that case, a special character $ is appended at the end, T [ n ] = $ .Typically a phrase is represented as a triple Z [ p ] = ( start, len, c ), where start isthe start position of the source, len is the length of the source and c is the trailingcharacter. Example 2.31.

Let T = ‘alabar a la alabarda$’ ; the LZ77 parsing is as follows: a l ab ar a la alabard a$ In this example the seventh phrase copies two characters starting at position 2 andhas a trailing character ‘ ’.One of the greatest advantages of this algorithm is the simple and fast schemeof decompression, opposed to the construction algorithm which is more complicated.Decompression runs in linear time by copying the source content referenced by eachphrase and then the trailing character. However, random text extraction is not aseasy.The LZ78 [ZL78] compression scheme is also dictionary-based. Its dictionary isthe set of all phrases previously produced. Because of this deﬁnition of the dictionarythe construction process is much simpler than that of LZ77.

Deﬁnition 2.32 ([ZL78]) . The

LZ78 parsing of text T [1 , n ] is a sequence Z [1 , n (cid:48) ] of phrases such that T = Z [1] Z [2] . . . Z [ n (cid:48) ], built as follows. Assume we have alreadyprocessed T [1 , i −

1] producing the sequence Z [1 , p − Z [ j ], for j ≤ p −

1, that is a preﬁx of T [ i, n ], set Z [ p ] = Z [ j ] T [ i + | Z [ j ] | ] andcontinue with i = i + | Z [ j ] | + 1.Typically a phrase is represented as Z [ p ] = ( j, c ), where j is the phrase numberof the source and c is the trailing character. Example 2.33.

Let T = ‘alabar a la alabarda$’ ; the LZ78 parsing is as follows: a l ab ar a la a lab ard a$ In this example the ninth phrase copies two characters starting at position 2 and hasa trailing character ‘ b ’.With respect to compression, both LZ77 and LZ78 converge to the entropy ofstationary ergodic sources [LZ76, ZL78]. It also converges below the empirical entropy(Section 2.3), as detailed next. 29 .14 Self-Indexes Chapter 2 Basic Concepts Deﬁnition 2.34 ([KM99]) . A parsing algorithm is said to be coarsely optimal ifits compression ratio ρ ( T ) diﬀers from the k -th order empirical entropy H k ( T ) bya quantity depending only on the length of the text and that goes to zero as thelength increases. That is, ∀ k ∃ f k , lim n →∞ f k ( n ) = 0, such that for every text T , ρ ( T ) ≤ H k ( T ) + f k ( | T | ) . Theorem 2.35 ([KM99, PWZ92]) . The LZ77 and LZ78 parsings are coarsely opti-mal.

As explained in Section 2.3, however, converging to H k ( T ) is not suﬃciently goodfor repetitive texts. Repetitive texts are originated in applications where many similarversions of one base text are generated (i.e., DNA sequences); or where successiveversions, each one similar to the preceding one (i.e., wiki), are generated. Statisticalcompressors are not able to capture this characteristic, because they predict a symbolbased only on a short previous context, and such statistics do not change when thetext is replicated many times (see Section 2.3 for the relation between H k ( T ) and H k ( T T )). Compressors based on repetitions, such as Lempel-Ziv parsings or grammarbased ones, do exploit this repetitiveness.

Deﬁnition 2.36. A self-index [NM07] is an index that uses space proportional tothat of the compressed text and solves the queries locate and extract . As this kind ofindexes can reproduce any text substring, they replace the original text. Additionally,some indexes provide more eﬃcient ways of computing exists and count queries.There are several general-purpose self-indexes, however most of them do notachieve high compression for repetitive texts, as they are only able to compress upto the k -th order empirical entropy (Section 2.3). Most are based on the BWT orsuﬃx array (see [NM07] for a complete survey). In the last years some self-indexesoriented to repetitive texts have been proposed. We cover these now. The Run-Length Compressed Suﬃx Array (RLCSA) [SVMN08] is based on the Com-pressed Suﬃx Array of Sadakane [Sad03]. This is built around the so called Ψ func-tion. 30 .14 Self-Indexes Chapter 2 Basic Concepts

Deﬁnition 2.37 ([GV05]) . Let A [1 , . . . , n ] be the suﬃx array of a text T . Then Ψ( i )is deﬁned as Ψ( i ) = A − [( A [ i ] mod n ) + 1]The Ψ function is the inverse of the LF mapping. Ψ maps suﬃx T [ A [ i ] , n ] to suﬃx T [ A [ i ] + 1 , n ], allowing one to scan the text from left to right. A run in the Ψ arrayis an interval [ a, b ] for which it holds ∀ i ∈ [ a, b − , Ψ( i + 1) = Ψ( i ) + 1.In the RLCSA, one run-length encodes the diﬀerences Ψ[ i ] − Ψ[ i −

1] and storeabsolute samples of the array Ψ. This structure is very fast for count and exists queries. Its major drawback is the sampling it requires for locate and extract queries,as it takes ( n log n ) /s extra bits to achieve locating time O ( s ), and time O ( s + r − l )for extract ( l, r ), where s is the sampling step.The number of runs may be much smaller than nH k ( T ) (for example runs ( T ) = runs ( T T ), whereas | T T | H k ( T T ) ≥ | T | H k ( T ) as shown in Section 2.3). However, thediﬀerence between the number of runs and the number of phrases in an LZ77 parsing[ZL77] may be a multiplicative factor as high as Θ( √ n ). For these reasons, theRLCSA seems to be an intermediate solution between LZ77 and empirical-entropy-based indexes.

In this section we present two indexes [KU96b, KU96a] by K¨arkk¨ainen and Ukko-nen. Although these are not self-indexes, they set the ground for several self-indexesproposed later. • First, they choose some indexing positions of the text. These can be evenlyspaced points [KU96b] or the points deﬁned by a Lempel-Ziv parsing [KU96a]. • The suﬃxes starting at those points are indexed in a suﬃx trie, and the reversedpreﬁxes in another trie. • The index in principle only allows one to ﬁnd occurrences crossing an indexingpoint. • To ﬁnd a pattern P of length m , they partition it in all m + 1 combinations ofpreﬁx and suﬃx. Veli M¨akinen, personal communication .14 Self-Indexes Chapter 2 Basic Concepts • For each partition, they search for the suﬃx in the suﬃx trie and for the preﬁxof the pattern in the reverse preﬁx trie. • The previous searches deﬁne a 2-dimensional range in a grid that relates eachindexed text preﬁx (in lexicographic order) with the text suﬃx that follows (inlexicographic order). That is, related preﬁxes and suﬃxes are consecutive inthe text. • A data structure supporting 2-dimensional range queries [Cha88], ﬁnds all pairsof related suﬃxes and preﬁxes, ﬁnding in this way the actual occurrences. • Additionally, using a Lempel-Ziv parsing they are able to ﬁnd all the occurrencesof the pattern. The occurrences are either found in the grid by the processdescribed above (primary occurrences), or by considering the copies detectedby the parsing (secondary occurrences), for which an additional method trackingthe copies ﬁnds the remaining occurrences.All following indexes can be thought as heirs of this general idea, which wasimproved by replacing or adding additional compact data structures to decrease thespace usage. In most cases, the parsing was restricted only to LZ78 (Section 2.14.3),since it simpliﬁes the index, and in others to text grammars (SLPs, Section 2.14.4).In the following two subsections we list the results obtained in those cases. Thisthesis can also be thought as a heir of this fundamental scheme: For the ﬁrst timecompact data structures supporting the LZ77 parsing have been developed in thisthesis, which show better performance on repetitive texts.

In this section we present the space and running times of two indexes based on LZ78.Although they oﬀer decent upper bounds and competitive performance on typicaltexts, experiments [SVMN08] have demonstrated that LZ78 is too weak to proﬁt fromhighly repetitive texts. There are other such self-indexes [FM05], not implementedas far as we know.

Arroyuelo et al. ’s LZ-Index

Navarro’s LZ-Index [Nav04] is the ﬁrst self-index based on the LZ78 parsing us-ing O ( nH k ( T )) bits of space (it is also the ﬁrst implemented in practice). It uses4 n (cid:48) log n (cid:48) (1 + o (1)) bits and takes O ( m log σ + ( m + occ ) log n (cid:48) ) time to locate the occ .14 Self-Indexes Chapter 2 Basic Concepts occurrences of a pattern of length m , where σ is the size of the alphabet, and n (cid:48) isthe number of phrases of the parsing.Arroyuelo et al. later improved the time and space of the index, achieving(2 + (cid:15) ) n (cid:48) log n (cid:48) (1 + o (1)) bits and O ( m + ( m + occ ) log n (cid:48) ) locate time [ANS10], or(3 + (cid:15) ) n (cid:48) log n (cid:48) (1 + o (1)) bits and O (( m + occ ) log n (cid:48) ) locate time [AN07]. Russo and Oliveira’s ILZI

Russo and Oliveira present a self-index based on the so-called maximal parsing , calledILZI [RO08].

Deﬁnition 2.38 ([RO08]) . Given a suﬃx trie T (of a set of strings), the T -maximalparsing of string T is the sequence of nodes v , . . . , v f such that T = v . . . v f and, forevery j , v j is the largest preﬁx of v j . . . v f that is a node of T .First, they compute the LZ78 parsing of T rev , and then generate a suﬃx tree T over the set of the reverse phrases. Next they build the maximal parsing of T using T . This parsing improves the compression of LZ78, as shown by the followinglemma. Lemma 2.39 ([RO08]) . If the number of phrases of the LZ78 parsing of T is n (cid:48) thenthe T -maximal parsing of T has at most n (cid:48) phrases. Their index uses at most 5 n (cid:48) log n (cid:48) (1 + o (1)) bits and takes O (( m + occ ) log n (cid:48) ) timeto locate the occ occurrences of a pattern of length m ( n (cid:48) is the number of blocks ofthe maximal parsing). Claude and Navarro [CN09] proposed a self-index based on straight-line programs (SLPs). SLPs are grammars in which the rules are either X i → α ∈ Σ or X i → X l X r , for l, r < i . The LZ78 [ZL78] parsing may produce an output exponentiallylarger than the smallest SLP. However, the LZ77 [ZL77] parsing outperforms thesmallest SLP [CLL + + O ( (cid:96) log (cid:96) ) rules and height O (log (cid:96) ), where (cid:96) is the size of the LZ77 parsing. Again, SLPs are intermediate between LZ77 andother methods. 33 .14 Self-Indexes Chapter 2 Basic Concepts The index [CN09] uses n (cid:48) log n + O ( n (cid:48) log n (cid:48) ) bits of space, where n (cid:48) is the numberof rules of the grammar. It solves extract ( l, r ) in O (( r − l + h ) log n (cid:48) ) time and locate in O (( m ( m + h ) + h · occ ) log n (cid:48) ) time, where h is the height of the derivation tree ofthe grammar and m the length of the pattern.Claude et al. [CFMPN10] evaluated a practical implementation using the grammarproduced by Re-Pair [LM00]. The results are competitive with the RLCSA only forextremely repetitive texts and short patterns.34 hapter 3A Repetitive Corpus Testbed In this chapter we present a corpus of repetitive texts. These texts are categorizedaccording to the source they come from into the following: Artiﬁcial Texts, Pseudo-Real Texts and Real Texts. The main goal of this collection is to serve as a standardtestbed for benchmarking algorithms oriented to repetitive texts. The corpus can bedownloaded from http://pizzachili.dcc.uchile.cl/repcorpus.html . This subset is composed of highly repetitive texts that do not come from any real-lifesource, but are artiﬁcially generated through some mathematical deﬁnition and haveinteresting combinatorial properties. F n ) This sequence is deﬁned by the recurrence F = F = F n = F n − F n − (3.1)The length of the string F n is the Fibonacci number f n and the sequence is a sturmianword [Lot02], which means it has i + 1 diﬀerent substrings (factors) of length i .35 .1 Artiﬁcial Texts Chapter 3 A Repetitive Corpus Testbed T n ) This sequence [AS99] is deﬁned by the recurrence T = T n = T n − T n − (3.2)where ¯ F is the bitwise negation operator (i.e., all get converted to and all to ). Because of the construction scheme of this sequence, there are many substringsof the form XX , for any string X . However, there are no overlapping squares, i.e.,substrings of the form X X or X X . Furthermore, this sequence is stronglycube-free, i.e., there are no substrings of the form XXx , where x is the ﬁrst characterof the string X . Another interesting property of this string is that it is recurrent.That is, given any ﬁnite substring w of length n , there is some length n w (often muchlonger than n ) such that w is contained in every substring of length n w . The lengthof these strings is | T n | = 2 n . R n ) A measure of string complexity, related to the regularities of the text and stronglyrelated to the LZ77 parsing [KK99], is the number of runs.

Deﬁnition 3.1. A period of string T [1 , n ] is a positive integer p holding that ∀ ≤ i ≤ n − p, T [ i ] = T [ i + p ]. A string is said to be periodic if its minimum period p issuch that p ≤ n/ Deﬁnition 3.2 ([Mai89]) . The substring T [ i, j ] is a run in a string T iﬀ T [ i, j ] isperiodic and T [ i, j ] is not extendable to the right ( j = n or T [ j + 1] (cid:54) = T [ j − p + 1])or left ( i = 1 or T [ i − (cid:54) = T [ i + p − . n [MKI +

08] and lower than 1 . n [CIT08]. Franek et al. [FSS03] show aconstructive and simple way to obtain strings with many runs; the n -th of thosestrings is denoted R n . The ratio of the runs of their strings to the length approaches3 / (1 + √

5) = 0 . . . . . 36 .2 Pseudo-Real Texts Chapter 3 A Repetitive Corpus Testbed Here we present a set of texts that were generated by artiﬁcially adding repetitivenessto real texts, thus we call them pseudo-real texts .To generate the texts, we took a preﬁx of 1MiB of all texts of Pizza&Chili Corpus ,we mutated them, and we concatenated all of them in the order they were generated.Our mutations take a random character position and change it to a random characterdiﬀerent from the original one.We used two diﬀerent schemes for the mutations. The ﬁrst one, denoted by a‘ ’, generates diﬀerent mutations of the ﬁrst text. The second, denoted by a ‘ ’,mutates the last text generated. The second scheme resembles the changes obtainedthrough time in a software project or the versions of a document, while the ﬁrstscheme produces changes analogous to the ones found in a collection of related DNAsequences.The mutation rate, i.e., percentage of mutated characters, was set to 0 . . . • Sources: This ﬁle is formed by C/Java source code obtained by concatenating allthe .c , .h , .C and .java ﬁles of the linux-2.6.11.6 and gcc-4.0.0 distributions. • Pitches: This ﬁle is a sequence of midi pitch values (bytes in 0-127, plus a fewextra special values) obtained from a myriad of MIDI ﬁles freely available onInternet. • Proteins: This ﬁle is a sequence of newline-separated protein sequences obtainedfrom the Swissprot database. • DNA: This ﬁle is a sequence of newline-separated gene DNA sequences obtainedfrom ﬁles to , plus and , from GutenbergProject. • English: This ﬁle is the concatenation of English text ﬁles selected from etext02 to etext05 collections of Gutenberg Project. • XML: This ﬁle is an XML that provides bibliographic information on majorcomputer science journals and proceedings and it was obtained from http://dblp.uni-trier.de . http://pizzachili.dcc.uchile.cl .3 Real Texts Chapter 3 A Repetitive Corpus Testbed This subset is composed of texts coming from real repetitive sources. These sourcesare DNA, Wikipedia Articles, Source Code, and Documents.For the case of DNA we concatenated the texts in random order. For the others,we concatenated the texts according to the date they were created, from oldest tonewest.

Our DNA texts come from diﬀerent sources. • The Saccharomyces Genome Resequencing Project provides two text collec-tions: para , which contains 36 sequences of Saccharomyces Paradoxus and cere ,which contains 37 sequences of

Saccharomyces Cerevisiae . • From the National Center for Biotechnology Information (NCBI) we collectedsome DNA sequences of the same bacteria. The species we collected are Es-cherichia Coli (23),

Salmonella Enterica (15),

Staphylococcus Aureus (14),

Strep-tococcus Pyogenes (13),

Streptococcus Pneumoniae (11) and

Clostridium Bo-tulium (10). We wrote in parentheses the total number of sequences of eachcollection. We chose these species as they were the only ones with 10 or morediﬀerent sequences. • A collection composed of 78,041 sequences of

Haemophilus Inﬂuenzae , alsocoming from the NCBI. Remark 3.3.

Although there are four bases { A , C , G , T } , DNA sequences may havealphabets of size up to 16 = 2 because some characters denote an unknown choiceamong the four bases. The most common character used is N , which denotes a totallyunknown symbol. ftp://ftp.ncbi.nih.gov/genomes/INFLUENZA/influenza.fna.gz .4 Statistics Chapter 3 A Repetitive Corpus Testbed We downloaded all versions of three Wikipedia articles,

Albert Einstein , Alan Turing and

Nobel Prize . We downloaded them in English (denoted en ) and German (denoted de ). We chose these languages as they are among the most widely used on Internetand their alphabet may be represented using standard 1-byte encodings. The versionsfor all documents are up to January 12, 2010, except for the English article of AlbertEinstein , which was downloaded only up to November 10, 2006 because of the massivenumber of versions it has.

We collected all versions 5.x of the

Coreutils package and removed all binary ﬁles,making a total of 9 versions. We also collected all 1.0.x and 1.1.x versions of the Linux Kernel , making a total of 36 versions. We took all pdf ﬁles of CIA World Leaders from January 2003 to December 2009,and converted them to text (using software pdftotext ). To understand the characteristics of the texts present in the

Repetitive Corpus , weprovide below some statistics about them. The statistics presented are the following: • Alphabet Size:

We give the alphabet size and the inverse probability of match-ing (IPM), which is the inverse of the probability that two characters chosen atrandom match. IPM is a measure of the eﬀective alphabet size. On a uniformlydistributed text, it is precisely the alphabet size. • Compression Ratio:

Since we are dealing with compressed indexes it is use-ful to have an idea of the compressibility of the texts using general-purpose ftp://mirrors.kernel.org/gnu/coreutils ftp://ftp.kernel.org/pub/linux/kernel .4 Statistics Chapter 3 A Repetitive Corpus Testbed compressors. The following compressors are used: gzip gives an idea of com-pressibility via dictionaries (an LZ77 parsing with limited window size); bzip2 gives an idea of block-sorting compressibility (using the BWT transform, Sec-tion 2.12); ppmdi gives an idea of partial-match-based compressors (relatedto the k -th order entropy, Section 2.3); p7zip gives an idea of LZ77 compres-sion with virtually unlimited window; and Re-Pair [LM00] gives an idea ofgrammar-based compression. All compressors were run with the highest com-pression options. • Empirical Entropy:

Here we give the empirical entropy H k of the text with k ranging from 0 to 8, measured as compression ratio. We also show, in paren-theses, the complexity function of T [Lot02] (or the number of contexts ) whichcount how many diﬀerent substrings of a given size does T have. This is exactlyour C ( T, k ) of Lemma 2.12. This measure has the following properties: C ( T,

1) = σC ( T, n + m ) ≤ C ( T, n ) C ( T, m )The lower this measure, the more repetitive the text is. For example, if C ( T, n ) =1 ∀ n , then T = c m for some character c . When P ( C, n ) = n + 1 the sequenceis said to be Sturmian (the Fibonacci sequence is an example of a

Sturmian string).

Remark 3.4.

The compression ratios are given as the percentage of the compressedﬁle size over the uncompressed ﬁle size, assuming the original ﬁle uses one byte percharacter. This means that 25% compression can be achieved over a DNA sequencehaving an alphabet { A,C,G,T } by simply using 2 bits per symbol. As seen from thereal-life examples given, these four symbols are usually predominant, so it is not hardto get very close to 25% on general DNA sequences as well. Tables 3.1-3.3 give the statistics of artiﬁcial texts. http://pizzachili.dcc.uchile.cl/utils/ppmdi.tar.gz .4 Statistics Chapter 3 A Repetitive Corpus Testbed File Size Σ IPM F T R Table 3.1: Alphabet statistics for Artiﬁcial Collection

File p7zip bzip2 gzip ppmdi Re-pair F T R Table 3.2: Compression statistics for Artiﬁcial Collection

File H H H H H H H H H F T R Table 3.3: Empirical entropy statistics for Artiﬁcial Collection41 .4 Statistics Chapter 3 A Repetitive Corpus Testbed

Tables 3.4-3.9 give the statistics of pseudo-real texts.

File Size Σ IPM

Xml 0.001% Table 3.4: Alphabet statistics for Pseudo-Real Collection (Scheme 1)

File Size Σ IPM

Xml 0.001% Table 3.5: Alphabet statistics for Pseudo-Real Collection (Scheme 2)42 .4 Statistics Chapter 3 A Repetitive Corpus Testbed

File p7zip bzip2 gzip ppmdi Re-Pair

Xml 0.001% Table 3.6: Compression statistics for Pseudo-Real Collection (Scheme 1)

File p7zip bzip2 gzip ppmdi Re-Pair

Xml 0.001% Table 3.7: Compression statistics for Pseudo-Real Collection (Scheme 2)43 .4 Statistics Chapter 3 A Repetitive Corpus Testbed

File H H H H H H H H H Xml 65.25% 38.63% 21.00% 12.50% 8.13% 6.00% 5.25% 4.75% 4.13%0.001% (1) (89) (3325) (20560) (56120) (98084) (134897) (168846) (200451)Xml 65.25% 38.63% 21.00% 12.50% 8.13% 6.00% 5.25% 4.75% 4.13%0.01% (1) (89) (4135) (30975) (79379) (131811) (177924) (220923) (261651)Xml 65.25% 38.75% 21.25% 12.75% 8.25% 6.13% 5.38% 4.88% 4.25%0.1% (1) (89) (5251) (67479) (196554) (326296) (440199) (550570) (661284)DNA 25.00% 24.25% 24.13% 24.00% 24.00% 23.75% 23.50% 22.88% 21.25%0.001% (1) (5) (18) (67) (260) (1029) (4102) (16349) (62437)DNA 25.00% 24.25% 24.13% 24.00% 24.00% 23.75% 23.50% 22.88% 21.25%0.01% (1) (5) (18) (67) (260) (1029) (4102) (16368) (63204)DNA 25.00% 24.25% 24.13% 24.00% 24.00% 23.75% 23.50% 22.88% 21.38%0.1% (1) (5) (19) (70) (264) (1034) (4109) (16399) (65168)English 57.25% 45.13% 34.75% 25.88% 19.88% 15.88% 12.50% 9.63% 7.25%0.001% (1) (106) (2659) (18352) (63299) (145194) (256838) (379514) (501400)English 57.25% 45.13% 34.75% 25.88% 19.88% 15.88% 12.50% 9.63% 7.25%0.01% (1) (106) (3243) (24063) (82896) (180401) (305292) (439387) (572056)English 57.25% 45.25% 34.88% 26.13% 20.13% 16.00% 12.50% 9.75% 7.25%0.1% (1) (106) (4491) (46116) (190765) (439130) (715127) (983435) (1237512)Pitches 66.13% 61.00% 53.50% 37.13% 16.38% 6.25% 2.88% 1.38% 0.75%0.001% (1) (73) (3549) (73664) (376958) (642406) (767028) (833456) (871970)Pitches 66.13% 61.00% 53.50% 37.25% 16.38% 6.25% 2.88% 1.38% 0.75%0.01% (1) (73) (3581) (76900) (399435) (684445) (821533) (898126) (946219)Pitches 66.13% 61.13% 53.63% 37.38% 16.63% 6.38% 2.88% 1.50% 0.88%0.1% (1) (73) (3733) (95838) (598394) (1096014) (1363610) (1543086) (1687166)Proteins 52.25% 52.13% 51.63% 47.50% 25.13% 4.63% 0.75% 0.25% 0.25%0.001% (1) (21) (422) (8045) (128975) (463357) (572530) (589356) (595906)Proteins 52.25% 52.13% 51.63% 47.50% 25.13% 4.63% 0.75% 0.25% 0.25%0.01% (1) (21) (422) (8045) (131064) (494845) (626269) (654067) (670075)Proteins 52.25% 52.13% 51.63% 47.50% 25.50% 4.88% 0.88% 0.38% 0.38%0.1% (1) (21) (425) (8076) (143879) (768510) (1150595) (1293347) (1403589)Sources 68.75% 46.88% 30.00% 19.63% 14.38% 11.00% 8.38% 6.88% 5.75%0.001% (1) (98) (4557) (29667) (75316) (130527) (194105) (259413) (320468)Sources 68.75% 46.88% 30.00% 19.63% 14.38% 11.00% 8.50% 6.88% 5.75%0.01% (1) (98) (5621) (42303) (102977) (170525) (244755) (320237) (391260)Sources 68.75% 47.00% 30.25% 19.88% 14.63% 11.13% 8.50% 7.00% 5.88%0.1% (1) (98) (7359) (104679) (299799) (498046) (687941) (872189) (1049051) Table 3.8: Empirical entropy statistics for Pseudo-Real Collection (Scheme 1)44 .4 Statistics Chapter 3 A Repetitive Corpus Testbed

File H H H H H H H H H Xml 65.25% 38.63% 21.13% 12.63% 8.13% 6.00% 5.25% 4.75% 4.13%0.001% (1) (89) (3325) (20560) (56120) (98084) (134897) (168846) (200451)Xml 65.25% 39.38% 22.00% 13.25% 8.63% 6.50% 5.63% 5.13% 4.50%0.01% (1) (89) (4135) (31042) (79630) (132163) (178388) (221499) (262329)Xml 65.25% 44.00% 28.75% 18.50% 12.25% 9.25% 8.00% 7.13% 6.25%0.1% (1) (89) (5255) (72227) (226418) (378994) (513539) (645141) (777226)DNA 25.00% 24.25% 24.13% 24.00% 24.00% 23.75% 23.50% 22.88% 21.25%0.001% (1) (5) (18) (67) (260) (1029) (4102) (16349) (62436)DNA 25.00% 24.25% 24.13% 24.13% 24.00% 23.88% 23.50% 23.00% 21.38%0.01% (1) (5) (18) (67) (260) (1029) (4102) (16369) (63242)DNA 25.00% 24.50% 24.38% 24.25% 24.25% 24.13% 23.88% 23.50% 22.38%0.1% (1) (5) (19) (70) (264) (1034) (4109) (16400) (65387)English 57.25% 45.13% 34.75% 26.00% 20.00% 15.88% 12.50% 9.63% 7.13%0.001% (1) (106) (2659) (18353) (63300) (145195) (256838) (379514) (501400)English 57.25% 45.50% 35.38% 26.50% 20.25% 15.88% 12.38% 9.50% 7.13%0.01% (1) (106) (3243) (24079) (83037) (180592) (305458) (439539) (572186)English 57.38% 47.75% 39.50% 31.13% 23.00% 16.63% 12.13% 8.88% 6.38%0.1% (1) (106) (4482) (47357) (202366) (466838) (749065) (1015587) (1265447)Pitches 66.13% 61.13% 53.63% 37.25% 16.38% 6.25% 2.88% 1.38% 0.75%0.001% (1) (73) (3549) (73664) (376958) (642406) (767028) (833456) (871970)Pitches 66.13% 61.13% 53.88% 37.50% 16.50% 6.38% 2.88% 1.38% 0.88%0.01% (1) (73) (3581) (76917) (399546) (684518) (821589) (898152) (946228)Pitches 66.13% 62.00% 55.88% 40.25% 17.38% 6.50% 3.13% 1.88% 1.38%0.1% (1) (73) (3742) (96359) (606175) (1103560) (1367417) (1545154) (1688526)Proteins 52.25% 52.13% 51.63% 47.50% 25.25% 4.63% 0.75% 0.25% 0.25%0.001% (1) (21) (422) (8045) (128975) (463357) (572529) (589356) (595906)Proteins 52.25% 52.13% 51.63% 47.63% 25.75% 5.00% 0.88% 0.50% 0.38%0.01% (1) (21) (422) (8045) (131079) (494846) (626306) (654107) (670114)Proteins 52.25% 52.13% 51.75% 48.75% 30.13% 7.63% 2.13% 1.50% 1.38%0.1% (1) (21) (426) (8072) (143924) (771311) (1154106) (1297080) (1407901)Sources 68.75% 47.00% 30.00% 19.75% 14.38% 11.00% 8.50% 6.88% 5.75%0.001% (1) (98) (4557) (29667) (75316) (130527) (194105) (259413) (320468)Sources 68.75% 47.50% 30.75% 20.13% 14.63% 11.13% 8.63% 7.00% 5.88%0.01% (1) (98) (5615) (42337) (103082) (170646) (244874) (320346) (391369)Sources 68.75% 51.25% 36.63% 24.38% 16.75% 12.13% 9.13% 7.25% 6.00%0.1% (1) (98) (7372) (108997) (319310) (525914) (718657) (904022) (1080824) Table 3.9: Empirical entropy statistics for Pseudo-Real Collection (Scheme 2)45 .4 Statistics Chapter 3 A Repetitive Corpus Testbed

Tables 3.10-3.12 give the statistics of real texts.

File Size Σ IPM

Cere 440MiB 5 4.301Para 410MiB 5 4.096Clostridium Botulium 34MiB 4 3.356Escherichia Coli 108MiB 15 4.000Salmonella Enterica 66MiB 9 3.993Staphylococcus Aureus 38MiB 5 3.579Streptococcus Pneumoniae 23MiB 8 3.836Streptococcus Pyogenes 24MIB 10 3.800Inﬂuenza 148MiB 15 3.845Coreutils 196MiB 236 19.553Kernel 247MiB 160 23.078Einstein (en) 446MiB 139 19.501Einstein (de) 89MiB 117 19.264Nobel (en) 85MiB 126 20.070Nobel (de) 31MiB 118 17.786Turing (en) 7.7MiB 103 21.096Turing (de) 85MiB 100 19.719World Leaders 45MiB 89 3.855

Table 3.10: Alphabet statistics for Real Collection

File p7zip bzip2 gzip ppmdi Re-Pair

Cere 1.14% 2.50% 26.36% 24.09% 1.86%Para 1.46% 26.34% 27.07% 24.88% 2.80%Clostridium Botulium 8.53% 25.88% 26.47% 24.12% 20.00%Escherichia Coli 4.72% 26.85% 28.70% 25.93% 9.63%Salmonella Enterica 5.61% 27.27% 28.79% 25.76% 12.42%Staphylococcus Aureus 2.89% 26.32% 28.95% 25.00% 5.26%Streptococcus Pneumoniae 4.78% 26.52% 27.39% 24.78% 9.57%Streptococcus Pyogenes 5.00% 26.25% 27.08% 25.00% 9.58%Inﬂuenza 1.35% 6.62% 7.43% 3.78% 3.31%coreutils 1.94% 16.33% 24.49% 12.76% 2.55%kernel 0.81% 21.86% 27.13% 18.62% 1.13%einstein.en 0.07% 5.38% 35.20% 1.61% 0.10%einstein.de 0.11% 4.38% 31.46% 1.35% 0.16%nobel.en 0.13% 2.94% 18.82% 1.76% 0.20%nobel.de 0.18% 3.55% 27.74% 1.68% 0.30%turing.en 1.09% 36.36% 285.71% 15.58% 1.71%turing.de 0.03% 0.18% 0.10% 0.11% 0.05%world leaders 1.29% 7.11% 17.78% 3.56% 1.78%

Table 3.11: Compression statistics for Real Collection46 .4 Statistics Chapter 3 A Repetitive Corpus Testbed

File H H H H H H H H H Cere 27.38% 22.63% 22.63% 22.50% 22.50% 22.50% 22.50% 22.38% 22.25%(1) (5) (25) (125) (610) (2515) (8697) (28080) (88624)Para 26.50% 23.50% 23.38% 23.38% 23.38% 23.38% 23.25% 23.25% 23.13%(1) (5) (25) (125) (625) (3125) (14725) (51542) (139149)Clostridium 23.25% 23.00% 22.88% 22.75% 22.75% 22.75% 22.63% 22.50% 22.25%Botulium (1) (4) (16) (64) (256) (1024) (4096) (16383) (65118)Escherichia 25.00% 24.75% 24.50% 24.38% 24.25% 24.25% 24.13% 24.13% 23.88%Coli (1) (15) (145) (779) (2715) (7436) (15641) (32561) (85363)Salmonella 25.00% 24.75% 24.50% 24.38% 24.25% 24.13% 24.13% 24.00% 23.75%Enterica (1) (9) (35) (97) (299) (1077) (4159) (16457) (65618)Staphylococcus 23.88% 23.75% 23.75% 23.63% 23.63% 23.63% 23.50% 23.25% 22.75%Aureus (1) (5) (18) (67) (260) (1029) (4102) (16391) (65282)Streptococcus 24.63% 24.38% 24.38% 24.25% 24.13% 24.13% 24.00% 23.75% 23.13%Pneumoniae (1) (8) (31) (133) (574) (2183) (6928) (21093) (71592)Streptococcus 24.50% 24.38% 24.25% 24.13% 24.13% 24.13% 24.00% 23.88% 23.25%Pyogenes (1) (10) (50) (174) (456) (1291) (4418) (16758) (65919)Inﬂuenza 24.63% 24.13% 24.13% 24.00% 23.88% 23.50% 22.00% 18.63% 13.25%(1) (15) (125) (583) (2329) (7978) (21316) (44748) (101559)coreutils 68.38% 51.25% 35.88% 23.88% 17.00% 12.88% 10.13% 8.00% 6.50%(1) (236) (18500) (169716) (606527) (1335553) (2258650) (3258896) (4247313)kernel 67.25% 50.50% 36.63% 25.75% 19.25% 15.13% 12.13% 9.63% 7.75%(1) (160) (7122) (90396) (351918) (773818) (1305616) (1912604) (2553008)einstein.en 62.00% 46.38% 33.38% 21.13% 13.25% 9.00% 6.50% 4.75% 3.50%(1) (139) (4546) (28685) (77333) (142559) (211506) (276343) (335151)einstein.de 63.00% 44.88% 32.63% 20.88% 13.25% 9.00% 6.13% 4.38% 3.13%(1) (117) (3278) (16765) (39010) (64884) (89914) (112043) (130473)nobel.en 62.63% 44.63% 30.50% 18.25% 11.50% 8.13% 6.00% 4.50% 3.38%(1) (126) (3566) (18079) (42334) (69855) (95644) (119260) (140401)nobel.de 61.13% 43.25% 31.13% 19.63% 12.50% 8.63% 6.00% 4.13% 3.00%(1) (118) (2726) (12959) (30756) (49695) (66108) (80467) (92184)turing.en 63.25% 45.75% 32.00% 19.13% 11.50% 7.63% 5.38% 3.88% 2.88%(1) (103) (2794) (14091) (33498) (55489) (75611) (93402) (108636)turing.de 62.38% 43.25% 29.25% 16.75% 9.50% 6.00% 3.88% 2.63% 2.00%(1) (100) (1806) (7268) (15407) (23070) (29038) (33714) (37335)world 43.38% 24.38% 17.25% 11.63% 7.63% 5.13% 4.00% 3.50% 3.13%leaders (1) (89) (2526) (23924) (106573) (246566) (374668) (468701) (547040)

Table 3.12: Empirical entropy statistics for Real Collection47 .5 Discussion Chapter 3 A Repetitive Corpus Testbed

It can be seen in the tables presented above that only p7zip and Re-Pair capturethe repetitiveness of the texts, achieving a compression ratio at least one order ofmagnitude better than bzip2, gzip or ppmdi. It can also be noted in Tables 3.6and 3.7 that p7zip is more robust to capture the repetitiveness than Re-Pair, aswith mutation ratios of 0.1% p7zip compresses 5 times better than Re-Pair. Table3.11 also shows that Re-Pair fails to capture some repetitions, as for all DNA textsexcept para and cere the compression of p7zip is two times better than that of Re-Pair. Tables 3.6 and 3.8 also show that the compression ratio of bzip2, gzip andppmdi does not change signiﬁcantly when increasing the repetitiveness of the text(decreasing mutation ratio). However, Tables 3.7 and 3.9 show that when decreasingthe mutation ratio from 0.1% to 0.01% the gain in compression is greater than 10%,but when decreasing the mutation to 0.001% the compression ratio does not improveas much. It can also be seen that the compression ratios of bzip2 and gzip are closeto the H - H , whereas, curiously, ppmdi compression ratios are not well predicted byany H k . Notice that, since artiﬁcial texts are extremely compressible, small constantoverheads (usually irrelevant) may produce signiﬁcant diﬀerences in the size of thecompressed ﬁle. 48 hapter 4LZ-End: A New Lempel-ZivParsing In this chapter we explain some properties of the LZ77 parsing (see Section 2.13)and present a variant that has the advantage of faster text extraction. The resultspresented in this chapter were published in the [KN10].

An interesting property of the LZ77 parsing is that it captures the repetitions of thetext. Text repetitions, as well as single-character edits on a text, alter the numberof phrases of the parsing very little. This explains why LZ77 is so strong on highlyrepetitive collections.

Lemma 4.1.

Given the texts T , T (cid:48) and the characters a , b ; the following statementshold H LZ ( T T ) = H LZ ( T ) + 1 (4.1) H LZ ( T T $ ) ≤ H LZ ( T $ ) + 1 (4.2) H LZ ( T T (cid:48) ) ≤ H LZ ( T aT (cid:48) ) + 1 (4.3) H LZ ( T aT (cid:48) ) ≤ H LZ ( T T (cid:48) ) + 1 (4.4) H LZ ( T aT (cid:48) ) ≤ H LZ ( T bT (cid:48) ) + 1 (4.5) where H LZ ( T ) is the number of phrases of the LZ77 parsing. .2 LZ-End Chapter 4 LZ-End: A New Lempel-Ziv Parsing Proof.

Assume the last phrase of the LZ77 parsing of T $ is $ and that H LZ ( T $ ) = n (cid:48) . That means the ﬁrst n (cid:48) − T . Now, if we have the text T T $ , we have that the ﬁrst n (cid:48) − T and thelast phrase would be T $ , hence inequality for Equation (4.2) holds. Now, assumethe last phrase of the parsing is A $ for some A (cid:54) = ε . Therefore the n (cid:48) -th phraseof the parsing of T T would be AB for some B such that 1 ≤ | B | < | T | , thus thisphrase does not completely cover T T . An additional phrase covers the remainingportion of the text, thus equality holds for Equation (4.2). The proof of Equation(4.1) is similar to the second part of Equation (4.2).Now consider Equation (4.3). Let Z [ p ] = XY the last phrase covering T , where X is a suﬃx of T and Y is a preﬁx of T (cid:48) . When adding the new character in themiddle, in the worst case the phrase gets converted to Xa (this is the phrase thatmay increase the total number of phrases). Then the following phrase will cover atleast the preﬁx Y , and each successive phrase will cover at least the next phrase ofthe original parsing. Hence, the number of phrases is at most one more than theoriginal number of phrases. The proofs for Equations (4.4) and (4.5) are similar tothe one above.The LZ78 parsing [ZL78] described in Section 2.14.3 is not that powerful. On T = a n it produces n (cid:48) = √ n + O (1) phrases, and this increases to n (cid:48) = √ n + O (1)on T T . LZ77, instead, produces n (cid:48) = log ( n ) + O (1) phrases on T and just one morephrase on T T . In this section we introduce a new LZ-like parsing. Its main characteristic is a fasterrandom text extraction, while its compression is close to that of LZ77.

Deﬁnition 4.2.

The

LZ-End parsing of text T [1 , n ] is a sequence Z [1 , n (cid:48) ] of phrases such that T = Z [1] Z [2] . . . Z [ n (cid:48) ], built as follows. Assume we have already processed T [1 , i −

1] producing the sequence Z [1 , p − T [ i, i (cid:48) − T [ i, n ] that is a suﬃx of Z [1] . . . Z [ q ] for some q < p , set Z [ p ] = T [ i, i (cid:48) ] and continuewith i = i (cid:48) + 1. Example 4.3.

Let T = ‘alabar a la alabarda$’ ; the LZ-End parsing is as follows: a l ab ar a la a labard a$ .2 LZ-End Chapter 4 LZ-End: A New Lempel-Ziv Parsing In this example, when generating the seventh phrase we cannot copy two charactersas in Example 2.31, because ‘la’ does not end in a previous end of phrase. However, ‘l’ does end in an end of phrase, hence we generate the phrase ‘la’ . Notice that thenumber of phrases increased from 9 to 10 with respect to the original LZ77 scheme.The LZ-End parsing is similar to the one proposed by Fiala and Green [FG89],in that theirs restricts where the sources start, while ours restricts where the sourcesend. This is the key feature that will allow us extract arbitrary phrases in constanttime per extracted symbol and, as shown in Section 4.2.2.

The output of an LZ77 compressor is, essentially, the sequence of triplets z ( p ) =( j, (cid:96), c ), such that the source of Z [ p ] = T [ i, i (cid:48) ] is T [ j, j + (cid:96) − (cid:96) = i (cid:48) − i , and c = T [ i (cid:48) ].This format allows fast decompression of T , but not decompressing an individualphrase Z [ p ] in reasonable time (one must basically decompress the whole text).The LZ-End parsing, although potentially generates more phrases than LZ77,permits a shorter encoding of each, of the form z ( p ) = ( q, (cid:96), c ), such that the source of Z [ p ] = T [ i, i (cid:48) ] is a suﬃx of Z [1] . . . Z [ q ], and the rest is as above. This representationis shorter because it stores the phrase identiﬁer rather than a text position. Weintroduce a more sophisticated encoding that will, in addition, allow us to extractindividual phrases in constant time per extracted symbol. • char [1 , n (cid:48) ] (using n (cid:48) (cid:100) log σ (cid:101) bits) encodes the trailing characters ( c above). • source [1 , n (cid:48) ] (using n (cid:48) (cid:100) log n (cid:48) (cid:101) bits) encodes the phrase identiﬁer where the sourceends ( q above). • B [1 , n ] (using n (cid:48) log nn (cid:48) + O ( n (cid:48) + n log log n log n ) bits in compressed form [RRR02], seeSection 2.5) marks the ending positions of the phrases in T .Thus we have z ( p ) = ( q, l, c ) = ( source [ p ] , select ( B, p + 1) − select ( B, p ) − , char [ p ]). We can also know in constant time that phrase p ends at select ( B, source [ p ])and that it starts (cid:96) positions before. Finally, we can know that the text position i belongs to phrase Z [ rank ( B, i −

1) + 1].51 .2 LZ-End Chapter 4 LZ-End: A New Lempel-Ziv Parsing

Extract ( start, len ) if len > then end ← start + len − p ← rank ( B, end ) if B [ end ] = 1 then Extract( start, len − output char [ p ] else pos ← select ( B, p ) + 1 if start < pos then Extract( start, pos − start ) len ← end − pos + 1 start ← pos Extract( select ( B, source [ p + 1]) − select ( B, p + 1) + start + 1 , len )Figure 4.1: LZ-End extraction algorithm for T [ start, start + len − The algorithm to extract an arbitrary substring in LZ-End is given in Figure 4.1. Theextraction works from right to left. First we compute the last phrase p intersectingthe substring. If the last character is stored explicitly, i.e., it is an end of phrase (seeLine 4), we output char [ p ] and recursively extract the remaining substring (line 5).Otherwise we split the substring into two parts. The right one is the intersection ofthe rightmost phrase covering the substring and the substring itself, and is extractedrecursively by going to the source of that phrase (line 10). The left part is alsoextracted recursively (line 13).While the algorithm works for extracting any substring, we can prove it takesconstant time per extracted symbol when the substring ends at a phrase. Theorem 4.4.

Function

Extract outputs a text substring T [ start, end ] ending at aphrase in time O ( end − start + 1) .Proof. If T [ start, end ] ends at a phrase, then B [ end ] = 1. We proceed by induction on len = end − start + 1. The case len ≤ T [ end ] at line 6 after a recursive call on the same phrase and length len −

1. This timewe go to line 8. The current phrase (now p + 1) starts at pos . If start < pos , we carryout a recursive call at line 10 to handle the segment T [ start, pos − p , induction shows that this takes time O ( pos − start + 1).52 .2 LZ-End Chapter 4 LZ-End: A New Lempel-Ziv Parsing Now the segment T [max( start, pos ) , end ] is contained in Z [ p + 1] and it ﬁnishes onesymbol before the phrase ends. Thus a copy of it ﬁnishes where Z [ source [ p + 1]] ends,so induction applies also to the recursive call at line 13, which extracts the remainingstring from the source instead of from Z [ p + 1], also in constant time per extractedsymbol.We have shown that the algorithm extracts any substring by starting from theend of a phrase. Thus, extracting an arbitrary substring may be more expensive thanan end-of-phrase aligned one. Deﬁnition 4.5.

Let T = Z [1] Z [2] . . . Z [ n (cid:48) ] be a LZ-parsing of T [1 , n ]. Then the height of the parsing is deﬁned as H = max ≤ i ≤ n C [ i ], where C is deﬁned as follows.Let Z [ i ] = T [ a, b ] be a phrase which source is T [ c, d ], then C [ k ] = C [( k − a ) + c ] + 1 , ∀ a ≤ k < bC [ b ] = 1Array C counts how many times a character was transitively copied from itsoriginal source. This is also the extraction cost of that character. Hence, the value H is the worst-case bound for extracting a single character in the LZ parse. Lemma 4.6.

In an LZ-End parsing it holds that H is smaller than the longest phrase,i.e., H ≤ max ≤ p ≤ n (cid:48) | Z [ p ] | .Proof. We will prove by induction that ∀ ≤ i < n, C [ i ] ≤ C [ i + 1] + 1. From thisinequality the lemma follows. For all positions i p where a phrase p ends, it holdsby deﬁnition that C [ i p ] = 1. Thus, for all positions i in the phrase p , we have C [ i ] ≤ C [ i p ] + i p − i ≤ | Z [ p ] | .The ﬁrst phrase of any LZ-End parsing is T [0], and the second is either T [1] or T [1] T [2]. In the ﬁrst case, we have C [1] C [2] = 1 ,

1, in the latter C [1] C [2] C [3] = 1 , , i p where the phrase Z [ p ] ends. Let i p +1 be the position where the phrase Z [ p +1] = T [ a, b ]ends (so a = i p + 1 and b = i p +1 ) and T [ c, d ] its source. For all i p + 1 ≤ i < i p +1 , C [ i ] = C [( i − a ) + c ] + 1, and since d ≤ i p , the inequality holds by inductive hypothesisfor i p + 1 ≤ i ≤ i p +1 −

2. By deﬁnition of the LZ-End parsing the source of a phraseends in a previous end of phrase, hence C [ i p +1 −

1] = C [ d ]+1 = 2 ≤ C [ i p +1 ]+1.For position i p +1 (end of phrase) the inequality trivially holds as it has by deﬁnitionthe least possible value. 53 .3 Compression Performance Chapter 4 LZ-End: A New Lempel-Ziv Parsing The above lemma does not hold for LZ77. Moreover, the LZ-End parsing yieldsa better extraction complexity.

Lemma 4.7.

Extracting a substring of length (cid:96) from an LZ-End parsing takes time O ( (cid:96) + H ) .Proof. Theorem 4.4 already shows that the cost to extract a substring ending at aphrase boundary is constant per extracted symbol. The only piece of the code inFigure 4.1 that does not amortize in this sense is line 13, where we recursively unrollthe last phrase, removing the last character each time, until hitting the end of thesubstring to extract. By deﬁnition of H , this line cannot be executed more than H times. So the total time is O ( (cid:96) + H ). Remark 4.8.

On a text coming from an ergodic Markov source of entropy h , theexpected value of the longest phrase is O ( log nh ). However, as we are dealing withhighly repetitive texts this expected length does not hold. Remark 4.9.

Algorithm Extract (Figure 4.1) also works on parsing LZ77, but inthis case the best theoretical bound we can prove for extracting a substring of length (cid:96) is O ( (cid:96)H ) . However, the results in Section 4.5.3 suggest that on average it may bemuch better. We study now the compression performance of LZ-End, ﬁrst with respect to theempirical k -th order entropy and then on repetitive texts. We prove that LZ-End is coarsely optimal. The main tool is the following lemma.

Lemma 4.10.

All the phrases generated by an LZ-End parse are diﬀerent.Proof.

Assume by contradiction Z [ p ] = Z [ p (cid:48) ] for some p < p (cid:48) . When Z [ p (cid:48) ] wasgenerated, we could have taken Z [ p ] as the source, yielding phrase Z [ p (cid:48) ] c , longer than Z [ p (cid:48) ]. This is clearly a valid source as Z [ p ] is a suﬃx of Z [1] . . . Z [ p ]. So this is notan LZ-End parse. 54 .3 Compression Performance Chapter 4 LZ-End: A New Lempel-Ziv Parsing Lemma 4.11 ([LZ76]) . Any parsing of T [1 , n ] into n (cid:48) distinct phrases satisﬁes n (cid:48) = O (cid:16) n log σ n (cid:17) , where σ is the alphabet size of T . Lemma 4.12 ([KM99]) . For any text T [1 , n ] parsed into n (cid:48) diﬀerent phrases, it holds n (cid:48) log n (cid:48) ≤ nH k ( T ) + n (cid:48) log nn (cid:48) + Θ( n (cid:48) (1 + k log σ )) , for any k . Lemma 4.13.

For any text T [1 , n ] parsed into n (cid:48) diﬀerent phrases, using LZ77 orLZ-End, it holds n (cid:48) log n ≤ nH k ( T ) + o ( n log σ ) for any k = o (log σ n ) .Proof. Arroyuelo and Navarro [AN] prove that the property holds for any LZ parsingfor which Lemmas 4.11 and 4.12 hold. In particular it holds for LZ77 and for ourproposal, the LZ-End.

Theorem 4.14.

The LZ-End compression is coarsely optimal.Proof.

The proof is based on the one by Kosaraju and Manzini [KM99] for LZ77.Here we consider in addition our particular encoding (the result holds for triplets( q, (cid:96), c ) as well). The size of the parsing in bits is

LZ-End ( T ) = n (cid:48) (cid:100) log σ (cid:101) + n (cid:48) (cid:100) log n (cid:48) (cid:101) + n (cid:48) log nn (cid:48) + O (cid:18) n (cid:48) + n log log n log n (cid:19) = n (cid:48) log n + O (cid:18) n (cid:48) log σ + n log log n log n (cid:19) . Thus from Lemmas 4.10 and 4.12 we have

LZ-End ( T ) ≤ nH k ( T ) + 2 n (cid:48) log nn (cid:48) + O (cid:18) n (cid:48) ( k + 1) log σ + n log log n log n (cid:19) . Now, by means of Lemma 4.11 and since n (cid:48) log nn (cid:48) is increasing in n (cid:48) , we get LZ-End ( T ) ≤ nH k ( T ) + O (cid:18) n log σ log log n log n (cid:19) + O (cid:18) n ( k + 1) log σ log n + n log log n log n (cid:19) = nH k ( T ) + O (cid:18) n log σ (log log n + ( k + 1) log σ )log n (cid:19) . Thus, diving by n and taking k and σ as constants, we get that the compression ratiois ρ ( T ) ≤ H k ( T ) + O (cid:18) n log log n log n (cid:19) . .3 Compression Performance Chapter 4 LZ-End: A New Lempel-Ziv Parsing We have not found a worst-case bound for the competitiveness of LZ-End compared toLZ77. However, we show, on the negative side, a sequence that produces almost twicethe number of phrases when parsed with LZ-End, so LZ-End is at best 2-competitivewith LZ77. On the positive side, we show that LZ-End satisﬁes some of the propertiesof Lemma 4.1.

Example 4.15.

Let T = · · · · · · . . . · ( σ − σ − σ . Thelength of the text is n = 3( σ − . . . ( σ − σ − σ LZ-End . . . ( σ − σ − σ The size of LZ77 is n (cid:48) = σ and the size of LZ-End is n (cid:48) = 2( σ − Lemma 4.16.

Given a text T , the following statements hold H LZ − End ( T T ) ≤ H LZ − End ( T ) + 2 (4.6) H LZ − End ( T T $ ) ≤ H LZ − End ( T $ ) + 1 (4.7) where H LZ − End ( T ) is the number of phrases of the LZ-End parsing.Proof. Assume H LZ − End ( T $ ) = n (cid:48) and that the last phrase of the LZ-End parsing of T $ is $ . That means the ﬁrst n (cid:48) − T . Now, if we havethe text T T $ , we have that the ﬁrst n (cid:48) − T and the last phrase would be T $ (since T ends at the end of the ( n (cid:48) − H LZ − End ( T T $ ) = H LZ − End ( T $ )). Now, assume the last phrase of the parsing is A $ for some A (cid:54) = ε ,and that the preﬁx of T in the n (cid:48) − xX , where x is a character.Therefore the n (cid:48) -th phrase of the parsing of T T would be at least Ax . Thenthe ( n (cid:48) + 1)-th phrase will be XA $ , thus equality holds for Equation (4.7). Thesituation is analogous if the n (cid:48) -th phrase extends beyond Ax . For Equation (4.6)consider that T is parsed into n (cid:48) − xX and the lastphrase is aAb (where x , a and b are characters and A (cid:54) = ε is a string). Then, the( n (cid:48) + 1)-th phrase of T T is at least xXa , and thus, the ( n (cid:48) + 2)-th phrase is Ab .Because there must exist a phrase ending in A for phrase aAb to exist. If, instead,the n (cid:48) -th phrase is just a , then the ( n (cid:48) + 1)-th phrase is xXa = T .56 .4 Construction Algorithm Chapter 4 LZ-End: A New Lempel-Ziv Parsing F ← {(cid:104)− , n + 1 (cid:105)} i ← p ← while i ≤ n do [ sp, ep ] ← [1 , n ] j ← (cid:96) ← j while i + j ≤ n do [ sp, ep ] ← BWS( sp, ep, T [ i + j ]) mpos ← arg max sp ≤ k ≤ ep A [ k ] if A [ mpos ] ≤ n + 1 − i then break j ← j + 1 (cid:104) q, f pos (cid:105) ← Successor( F , sp ) if f pos ≤ ep then (cid:96) ← j Insert( F , (cid:104) p, A − [ n + 1 − ( i + (cid:96) )] (cid:105) ) output ( q, (cid:96), T [ i + (cid:96) ]) i ← i + (cid:96) + 1, p ← p + 1Figure 4.2: LZ-End construction algorithm. F stores pairs (cid:104) phrase identiﬁer, suﬃxarray position (cid:105) and answers successor queries on the text position. BW S ( sp, ep, c )was deﬁned in Section 2.12. We present an algorithm to compute the parsing LZ-End, inspired by the algo-rithm CSP2 by Chen et al. [CPS08]. We compute the range of all text preﬁxes endingwith a pattern P , rather than suﬃxes starting with P [FG89].We ﬁrst build the suﬃx array (Section 2.11) A [1 , n ] of the reverse text, T rev = T [ n − . . . T [2] T [1] $ , so that T rev [ A [ i ] , n ] is the lexicographically i -th smallest suﬃxof T rev . We also build its inverse permutation: A − [ j ] is the lexicographic rank of T rev [ j, n ]. Finally, we build the Burrows-Wheeler Transform (BWT) (Section 2.12)of T rev , T bwt [ i ] = T rev [ A [ i ] −

1] (or T rev [ n ] if A [ i ] = 1).On top of the BWT we will apply backward search (Section 2.12) to ﬁnd outwhether there are occurrences of a T [ i, i (cid:48) −

1] (Deﬁnitions 2.30 and 4.2).Since, for LZ-End, the phrases must in addition ﬁnish at a previous phrase end,we maintain a dynamic set F where we add the ending positions of the successivephrases we create, mapped to A . That is, once we create phrase Z [ p ] = T [ i, i (cid:48) ], we57 .4 Construction Algorithm Chapter 4 LZ-End: A New Lempel-Ziv Parsing

10 16 21 9 15 13 19 6 3 11 17 2 7 4 20 8 14 12 18 5 121 12 9 14 20 8 13 16 4 1 10 18 6 17 5 2 11 19 7 15 3 {5,18,14,20,4,2}alabar_a_la_alabarda$

Figure 4.3: Example of LZ-End construction algorithminsert A − [ n + 1 − i (cid:48) ] into F .Backward search over T rev adapts very well to our purpose. By considering thepatterns P = ( T [ i, i (cid:48) − rev for consecutive values of i (cid:48) , we are searching backwardsfor P in T rev , and thus ﬁnding the ending positions of T [ i, i (cid:48) −

1] in T , by carryingout one further BWS step for each new i (cid:48) value. Thus we can use F naturally.As we advance i (cid:48) in T [ i, i (cid:48) − A [ sp, ep ] contains some occurrenceﬁnishing before i in T , that is, starting after n + 1 − i in T rev . If it does not, thenwe stop looking for larger i (cid:48) values as there are no matches preceding T [ i ]. For this,we precompute a Range Maximum Query (RMQ) data structure [FH07] on A , whichanswers queries mpos = arg max sp ≤ k ≤ ep A [ k ]. Then if A [ mpos ] is not large enough,we stop.In addition, we must know if i (cid:48) ﬁnishes at some phrase end, i.e., if F containssome value in [ sp, ep ]. A successor query on F ﬁnds the smallest value f pos ≥ sp in F . If f pos ≤ ep , then it represents a suitable LZ-End source for T [ i, i (cid:48) ]. Otherwise,as the condition could hold again for a later [ sp, ep ] range, we do not stop but recallthe last j = i (cid:48) where it was valid. Once we stop because no matches ending before T [ i ] exist, we insert phrase Z [ p ] = T [ i, j ] and continue from i = j + 1. This mayretraverse some text since we had processed up to i (cid:48) ≥ j . We call N ≥ n the totalnumber of text symbols processed.The algorithm is depicted in Figure 4.2. Example 4.17.

Figure 4.3 shows the structures used during the parsing of the string ‘alabar a la alabarda$’ . The array A corresponds to the suﬃx array of the re-versed text and A − to its inverse permutation. The ﬁgure shows the parsing up tothe 6th phrase and the values inserted in the dictionary F . The values inserted in F are A − [ len − i ], where i is the ending position of a phrase and len the length ofthe text. For example, the second phrase ends in position 2, thus the value insertedcorresponds to A − [21 −

2] = 18. 58 .4 Construction Algorithm Chapter 4 LZ-End: A New Lempel-Ziv Parsing

Now we continue the process to generate the next phrase. First, using BWSwe ﬁnd the interval of A that represents the suﬃxes (of the reverse text) startingwith ‘l’ , obtaining the range [17 ,

18] (right gray zone). Then we look in F for thesuccessor of 17, obtaining the value 18, which is still in the range. Hence, we havefound a valid source. Afterward, we continue with the next character. Again, withBWS we ﬁnd the interval of A representing the suﬃxes (of the reverse text) startingwith ‘al’ (left gray zone), which are the preﬁxes of the text ending with ‘la’ . Thisgives us the range [11 , F , which is 14. Sincethis value is outside the interval, there are no valid sources. We continue this processuntil there are no more possible sources. Finally, we get that the only valid source is ‘l’ , generating the new phrase ‘la’ .In theory the construction algorithm can work within bit space (1) nH k ( T rev ) + o ( n log σ ) = nH k ( T ) + o ( n log σ ) (since nH k ( T ) = nH k ( T rev ) + O (log n ) [FM05, The-orem A.3]) for building the BWT incrementally [GN08]; plus (2) 2 n + o ( n ) bits for theRMQ structure [FH07]; plus (3) O ( n (cid:48) log n ) bits for a successor data structure. Afterbuilding the BWT incrementally in time O ( n log n (cid:100) log σ log log n (cid:101) ) [GN08], we can make itstatic, so that it supports access to the successive characters of T in time O ( (cid:100) log σ log log n (cid:101) ),as well to A and A − in time O (log (cid:15) n ) for any constant (cid:15) > O ( n ) time and within the same ﬁnal space, and answersqueries in constant time. The successor data structure could be a simple balancedsearch tree, with leaves holding Θ(log n ) elements, so that the access time is O (log n )and the space is n (cid:48) log n (1 + o (1)) [Mun86]. Thus, using Lemmas 4.11 and 4.12, theoverall construction space is 2 n ( H k ( T ) + 1) + o ( n log σ ) bits, for any k = o (log σ n ).The time is dominated by the BWT construction, O ( n log n (cid:100) log σ log log n (cid:101) ), plus the N accesses to A , O ( N log (cid:15) n ). If, instead, we use O ( n log n ) bits of space, we canbuild and store explicitly A and A − in O ( n ) time [KS03]. The overall time becomes O ( N (cid:100) log σ log log n (cid:101) ).Note that a simpliﬁcation of our construction algorithm, disregarding F (andthus N = n ) builds the LZ77 parsing using just n ( H k ( T ) + 2) + o ( n log σ ) bits and O ( n log n (log (cid:15) n + o (log σ ))) time, which is less than the best existing solutions [OS08,CPS08].In practice our implementation of the algorithm works within byte space (1) n as we maintain T explicitly; plus (2) 2 . n for our implementation of BWT (fol-lowing Navarro’s “large” FM-index implementation [Nav09], where L is maintainedexplicitly); plus (3) 4 n for A , which is explicitly maintained; plus (4) 0 . n for Fis-cher’s implementation of RMQ [FH07]; plus (5) n for A − , using a sampling-basedimplementation of inverse permutations [MRRR03] (Section 2.7); plus (6) 12 n (cid:48) for a59 .5 Experimental Results Chapter 4 LZ-End: A New Lempel-Ziv Parsing balanced binary tree implementing the successor structure. This adds up to less than10 n bytes in practice. A is built in time O ( n log n ) in practice; other constructiontimes are O ( n ). After this, the time of the algorithm is O ( N log n (cid:48) ) = O ( N log n ).As we see soon, N is usually (but not always) only slightly larger than n ; we nowprove it is limited by the phrase lengths. Lemma 4.18.

The amount of text retraversed at any step is < | Z [ p ] | for some p .Proof. Say the last valid match T [ i, j −

1] was with suﬃx Z [1] . . . Z [ p −

1] for some p , thereafter we worked until T [ i, i (cid:48) −

1] without ﬁnding any other valid match, andthen formed the phrase (with source p − T [ j + 1 , i (cid:48) − Z [ p ] since otherwise Z [1] . . . Z [ p ] would have been a validmatch. Remark 4.19.

On ergodic sources with entropy h , N = O ( nh log n ), but as explainedthis is not a realistic model on repetitive texts. We implemented two diﬀerent LZ-End encoding schemes. The ﬁrst is as explained inSection 4.2.1. In the second (

LZ-End2 ) we store the starting position of the source, select ( B, source [ p ]), rather than the identiﬁer of the source, source [ p ]. This in theoryraises the nH k ( T ) term in the space to 2 nH k ( T ) (and noticeably in practice, as seensoon), yet we save one select operation at extraction time (line 13 in Figure 4.1),which has a signiﬁcant impact in performance. In both implementations, bitmap B is represented by δ -encoding the consecutive phrase lengths (Section 2.5.2). Recallthat, in a δ -encoded bitmap, select ( B, p ) and select ( B, p +1) cost O (1) after solving p ← rank ( B, end ), thus LZ-End2 does no select operations for extracting.We compare our compressors with

LZ77 and

LZ78 implemented by ourselves.LZ77 triples are encoded in the same way as LZ-End2. We include the best performingcompressors of Chapter 3, p7zip and Re-Pair. Compared to p7zip, LZ77 diﬀers inthe ﬁnal encoding of the triples, which p7zip does better. This is orthogonal to theparsing issue we focus on in this thesis. We also implemented

LZB [Ban09], whichlimits the distance dist at which the phrases can be from their original (not transitive)sources, so one can decompress any window by starting from that distance behind;and

LZ-Cost , a novel proposal where we limit the number of times any text charactercan be copied (i.e., its C [ · ] value in Deﬁnition 4.5), thus directly limiting the maximumcost per character extraction. We have found no eﬃcient parsing algorithm for LZB .5 Experimental Results Chapter 4 LZ-End: A New Lempel-Ziv Parsing and LZ-Cost , thus we test them on small texts only. We also implemented

LZ-Begin ,the “symmetric” variant of LZ-End, which also allows random phrase extraction inconstant time per extracted symbol. LZ-Begin forces the source of a phrase to startwhere some previous phrase starts, just like Fiala and Green [FG89], yet phrases havea leading rather than a trailing character. Although the parsing is much simpler, thecompression ratio is noticeably worse than that of LZ-End, as we will see in Section4.5.1.We used the texts of the Canterbury corpus ( http://corpus.canterbury.ac.nz ), the 50MB texts from the Pizza&Chili corpus ( http://pizzachili.dcc.uchile.cl ), and highly repetitive texts from the previous chapter. We use a 3.0 GHz Core 2Duo processor with 4GB of main memory, running Linux 2.6.24 and g++ (gcc version4.2.4) compiler with -O3 optimization.

Table 4.1 gives compression ratios for the diﬀerent collections and parsers. Figure4.4 shows the same results graphically for one representative text of each collection.For LZ-End we omit the sampling for bitmap B , as it can be reconstructed on theﬂy at loading time. LZ-End is usually 5% worse than LZ77, and at most 10% over iton general texts and 20% on the highly repetitive collections, where the compressionratios are nevertheless excellent. LZ78 is from 20% better to 25% worse than LZ-Endon typical texts, but it is orders of magnitude worse on highly repetitive collections.With parameter log( n ) /

2, LZ-Cost is usually close to LZ77, yet sometimes it is muchworse, and it is never better than LZ-End except by negligible margins. LZB is notcompetitive at all. Finally, LZ-Begin is about 30% worse than LZ77 on typical texts,and up to 40 times worse for repetitive texts. This is because not all phrases of theparsing are unique (Lemma 4.20). This property was the key to prove the coarseoptimality of the LZ parsings.

Lemma 4.20.

Not all the phrases generated by an LZ-Begin parsing are diﬀerent.Proof.

We prove this lemma by showing a counter-example. Let T = A xy A y A z ,where x , y , z are distinct characters and A is a string. Suppose we have parsed up to A x , then the next phrase will be y A , and the following phrase will also be y A .The above results show that LZ-End achieves very competitive compression ratios,even in the challenging case of highly repetitive sequences, where LZ77 excells.61 .5 Experimental Results Chapter 4 LZ-End: A New Lempel-Ziv Parsing LZ77 LZ78 LZ-End LZ-Cost LZB LZ-Begin Re-PairCanterbury

Size(KiB)alice29.txt 148.52 47.17% 49.91% 49.32% 48.51% 61.75% 59.02% 72.29%asyoulik.txt 122.25 51.71% 52.95% 53.51% 52.41% 66.42% 62.34% 81.52%cp.html 24.03 43.61% 53.60% 45.53% 46.27% 66.26% 56.93% 78.65%ﬁelds.c 10.89 39.21% 54.73% 41.69% 44.44% 61.32% 60.61% 65.19%grammar.lsp 3.63 48.48% 57.85% 50.41% 56.30 % 67.02% 67.14% 85.60%lcet10.txt 416.75 42.62% 46.83% 44.65% 43.44% 56.72% 54.21% 57.47%plrabn12.txt 470.57 50.21% 49.34% 52.06% 50.83% 63.55% 59.15% 74.32%xargs.1 4.13 57.87% 65.38% 59.56% 59.45% 86.37% 73.14% 107.33%aaa.txt 97.66 0.055% 0.51% 0.045% 1.56% 0.95% 0.040% 0.045%alphabet.txt 97.66 0.110% 4.31% 0.105% 0.23% 1.15% 0.100% 0.081%random.txt 97.66 107.39% 90.10% 105.43% 107.40% 121.11% 106.9% 219.24%E.coli 4529.97 34.13% 27.70% 34.72% - - 35.99% 57.63%bible.txt 3952.53 34.18% 36.27% 36.44% - - 43.98% 41.81%world192.txt 2415.43 29.04% 38.52% 30.99% - - 41.52% 38.29%pi.txt 976.56 55.73% 47.13% 55.99% - - 57.36% 108.08%

LZ77 LZ78 LZ-End LZ-Begin Re-PairPizza Chili

Size(MiB)Sources 50 28.50% 41.14% 31.00% 41.95% 31.07%Pitches 50 44.50% 59.30% 45.78% 57.22% 59.90%Proteins 50 47.80% 53.20% 47.84% 54.95% 71.29%DNA 50 31.88% 28.12% 32.76% 34.28% 45.90%English 50 31.12% 41.80% 31.12% 38.54% 30.50%XML 50 17.00% 21.24% 17.64% 25.49% 18.50%

LZ77 LZ78 LZ-End LZ-Begin Re-PairRepetitive

Size(MiB)Wikipedia Einstein 357.40 9.97 × − % 9.29% 1.01 × − % 4.27% 1.04 × − %World Leaders 40.65 1.73% 15.89% 1.93% 7.97% 1.89%Rich String 11 48.80 3.20 × − % 0.82% 4.18 × − % 0.01% 3.75 × − %Fibonacci 42 255.50 7.32 × − % 0.40% 5.32 × − % 6.07 × − % 2.13 × − %Para 409.38 2.09% 25.49% 2.48% 7.29% 2.74%Cere 439.92 1.48% 25.33% 1.74% 6.15% 1.86%Coreutils 195.77 3.18% 27.57% 3.35% 7.33% 2.54%Kernel 246.01 1.35% 30.02% 1.43% 3.43% 1.10% Table 4.1: Compression ratio of diﬀerent parsings, in percentage of compressed overoriginal size. We use parameter cost = (log n ) / LZ-Cost and dist = n/ LZB .62 .5 Experimental Results Chapter 4 LZ-End: A New Lempel-Ziv Parsing R a t i o o f o r i g i na l t e x t Compression of different Compressors

LZ77LZ78LZ-EndLZ-BeginRe-PairLZ-CostLZB

Figure 4.4: Compression ratio for diﬀerent compressorsConsistently with Chapter 3, Re-Pair results show that grammar-based compres-sion is a relevant alternative. Yet, we note that it is only competitive on highlyrepetitive sequences, where most of the compressed data is in the dictionary. Thisimplementation applies sophisticated compression to the dictionary, which we do notapply on our compressors. Those sophisticated dictionary compression techniquesprevent direct access to the grammar rules, essential for extracting substrings.

Figure 4.5 shows parsing times on two ﬁles for LZ77 (implemented following CSP2[CPS08]), LZ-End with the algorithm of Section 4.4, and p7zip. We show sepa-rately the time of the suﬃx array construction algorithm we use, libdivsufsort ( http://code.google.com/p/libdivsufsort ), common to LZ77 and LZ-End.Our LZ77 construction time is competitive with the state of the art (p7zip), thusthe excess of LZ-End is due to the more complex parsing. Least squares ﬁttingfor the nanoseconds/char yields 10 . n + O (1 /n ) (LZ77) and 82 . n + O (1 /n )(LZ-End) for Einstein text, and 32 . n + O (1 /n ) (LZ77) and 127 . n + O (1 /n )(LZ-End) for XML. The correlation coeﬃcient is always over 0.999, which suggeststhat N = O ( n ) and our parsing takes O ( n log n ) time in practice. Indeed, acrossall of our collections, the ratio N/n stays between 1.05 and 1.37, except on aaa.txt and alphabet.txt , where it is 10–14 (which suggests that N = ω ( n ) in the worstcase). Figure 4.6 shows the total text traversed by LZ-End parsing algorithm for twodiﬀerent texts.The LZ-End parsing time breaks down as follows. For XML: BWS 36%, RMQ63 .5 Experimental Results Chapter 4 LZ-End: A New Lempel-Ziv Parsing t i m e ( µ s ) / n log(n) PizzaChili XML File (size=282MiB)

LZ77LZ-EndP7ZipSA 0 0.5 1 1.5 2 2.5 19 20 21 22 23 24 25 26 27 28 t i m e ( µ s ) / n log(n) Wikipedia Einstein (size=357MiB)

LZ77LZ-EndP7ZipSA

Figure 4.5: Parsing times for XML and Wikipedia Einstein, in microseconds percharacter. t o t a l w o r k / l eng t h log(size (MB)) Total text traversal of LZ-End

XMLWikipedia

Figure 4.6: Total text traversed during LZ-End construction algorithm.19%, tree operations 33%, SA construction 6% and inverse SA lookups 6%. ForEinstein: BWS 56%, RMQ 19%, tree operations 17%, and SA construction 8% (theinverse SA lookups take negligible time).

Figure 4.7 shows the extraction speed of arbitrary substrings of increasing length. Thethree parsings (LZ77, LZ-End and LZ-End2) are parameterized to use approximatelythe same space, 550KiB for Wikipedia Einstein and 64MiB for XML. This is achievedby adjusting the sample step s of the δ -encoded bitmaps (Section 2.5.2). It can beseen that (1) the time per character stabilizes after some extracted length, as expectedfrom Lemma 4.7, (2) LZ-End variants extract faster than LZ77, especially on very64 .5 Experimental Results Chapter 4 LZ-End: A New Lempel-Ziv Parsing repetitive collections, and (3) LZ-End2 is faster than LZ-End, even if the latter investsits better compression in a denser sampling. Least squares ﬁtting for the extractiontime of a substring of length m are given in Table 4.2.Pizza&Chili XMLScheme ModelLZ77 4 .

44 + 0 . m LZ-End 7 .

40 + 0 . m LZ-End2 6 .

41 + 0 . m Wikipedia EinsteinScheme ModelLZ77 19 .

09 + 0 . m LZ-End 5 .

75 + 0 . m LZ-End2 5 .

64 + 0 . m Table 4.2: Least squares ﬁtting for extraction time. All correlation coeﬃcients arealways over 0.999. E x t r a c t i on s peed ( M c ha r s / s ) log(extraction length) PizzaChili XML File (size=282MiB)

LZ77LZEndLZEnd2 0 1 2 3 4 5 6 0 2 4 6 8 10 12 14 16 18 E x t r a c t i on s peed ( M c ha r s / s ) log(extraction length) Wikipedia Einstein (size=357MiB)

LZ77LZEndLZEnd2

Figure 4.7: Extraction speed vs extracted length, for XML and Wikipedia Einstein.We now set the extraction length to 1,000 and measure the extraction speed percharacter, as a function of the space used by the data and the sampling. Here we usebitmap B and its sampling for the other formats as well. LZB and LZ-Cost have alsotheir own space/time trade-oﬀ parameter; we tried several combinations and chosethe points dominating the others. Figure 4.8 shows the results for small and largeﬁles.It can be seen that LZB is not competitive, whereas LZ-Cost follows LZ77 closely(while oﬀering a worst-case guarantee). The LZ-End variants dominate all the trade-oﬀ except when LZ77/LZ-Cost are able of using less space. On repetitive collections,LZ-End2 is more than 2.5 times faster than LZ77 at extraction.65 .5 Experimental Results Chapter 4 LZ-End: A New Lempel-Ziv Parsing E x t r a c t i on s peed ( M c ha r s / s ) Compression Ratio

Canterbury plrabn12.txt (size=471KiB)

LZ77LZEndLZEnd2LZCostLZBLZ78 0 5 10 15 20 25 30 0 0.02 0.04 0.06 0.08 0.1 E x t r a c t i on s peed ( M c ha r s / s ) Compression Ratio

Fibonacci Sequence (size=502KiB)

LZ77LZEndLZEnd2LZCostLZB 0 0.5 1 1.5 2 2.5 3 3.5 15 20 25 30 35 40 45 50 E x t r a c t i on s peed ( M c ha r s / s ) Compression Ratio

PizzaChili XML File (size=282MiB)

LZ77LZEndLZEnd2LZ78 0 1 2 3 4 5 6 7 8 9 0.1 0.15 0.2 0.25 0.3 0.35 E x t r a c t i on s peed ( M c ha r s / s ) Compression Ratio

Wikipedia Einstein (size=357MiB)

LZ77LZEndLZEnd2

Figure 4.8: Extraction speed vs parsing and sampling size, on diﬀerent texts.66 hapter 5An LZ77-Based Self-Index

In this chapter we describe a self-index based on the LZ77 parsing. It builds on theideas of the original LZ-based index proposed by K¨arkk¨ainen and Ukkonen [KU96a,K¨ar99] and the ideas presented by Navarro for reducing its space usage [Nav08]. Ourindex will be mostly independent of the type of Lempel-Ziv parsing used, and wewill combine it with LZ77 and LZ-End. We use compact data structures to achievethe minimum possible space. These structures also allow one to convert the originalindex into a self-index, so that we do not need the text anymore.As we will show, the index includes all the structures needed to randomly extractany substring from the text, introduced in the previous chapter. The worst-casetime to extract a substring of length (cid:96) is O ( (cid:96)H ) for LZ77 and O ( (cid:96) + H ) for LZ-End(see Section 4.2.2). Additionally, the proposed index only supports count queries byperforming a full locate , and exists queries by essentially locating one occurrence. Forthese reasons, in the following we focus only on locate queries. Assume we have a text T of length n , which is partitioned into n (cid:48) phrases usinga LZ77-like compressor (see Chapter 4). Let P [1 , m ] be a search pattern. We willcall primary occurrences of P those covering more than one phrase; special primaryoccurrences those ending at the end of a phrase and being completely covered by thephrase; and secondary occurrences those occurrences completely covered by a phraseand not ending at an end of phrase. 67 .2 Primary Occurrences Chapter 5 An LZ77-Based Self-Index Example 5.1. a l ab ar a la alabard a$

In this example the occurrence of ‘lab’ starting at position 2 (red color) is primaryas it spans two phrases. The second occurrence, starting at position 14 (blue color)is secondary. The occurrence of ‘rd’ starting at position 18 (green color) is specialprimary.We need to distinguish between these three types of occurrences, as we will ﬁndﬁrst the primary occurrences (including the special ones), which will be then usedto recursively ﬁnd the secondary ones (which, in turn, will be used to ﬁnd furthersecondary occurrences).

By deﬁnition, a primary occurrence covers at least two phrases. Thus, each primaryoccurrence can be seen as P = LR , where the left side L is a suﬃx of a phrase andthe right side R is the concatenation of zero or more consecutive phrases plus a preﬁxof the next phrase. For this reason, to ﬁnd this type of occurrences we partition thepattern in two (in every possible way). Then, we search for the occurrences of theleft part of the pattern in the suﬃxes of the phrases and for the right part in thepreﬁxes of the suﬃxes of the text starting at beginning of phrases. Then, we needto ﬁnd which pairs of left and right occurrences actually represent an occurrence ofpattern P :1. Partition the pattern P [1 , m ] into P [1 , i ] and P [ i + 1 , m ] for each 1 ≤ i < m .2. Search for the right part P [ i + 1 , m ] in the preﬁxes of the suﬃxes of the textstarting at phrases.3. Search for the left part P [1 , i ] in suﬃxes of phrases.4. Connect both results, generating all primary occurrences. To ﬁnd the right side P [ i + 1 , m ] of the pattern we use a suﬃx trie (recall Sections2.9 and 2.14.2) that indexes all suﬃxes of T starting at the beginning of a phrase. In68 .2 Primary Occurrences Chapter 5 An LZ77-Based Self-Index the leaves of the tree we store the identiﬁer (id) of the phrases. Conceptually, theseform an array id that stores the phrase ids in lexicographic order (i.e., the leaves ofthe suﬃx trie). As we see later, we do not need to store id explicitly. a l ab ar a la alabard a$ (((())((((())))(())))(())) dfudscharsid a Figure 5.1: The suﬃx trie for the string ‘alabar a la alabarda$’ . The dark nodeis the note at which we stop searching for the pattern ‘la’ , and the gray leavesrepresent the phrases that start with that pattern.We will represent the suﬃx trie as a labeled tree using DFUDS (Section 2.8). Tosearch for a pattern we descend through the tree using labeled child (recall Section2.8), and then discard as many characters of the pattern as the skip of the branchindicates. We continue this process either until we reach a leaf, the pattern is com-pletely consumed, or we cannot descend anymore. Our answer will be an interval ofthe array of ids, representing all phrases starting with the pattern P [ i + 1 , m ]. Incase we consume the pattern in an internal node, we need to go to the leftmost andrightmost leaves in order to obtain the interval, which is computed using leaf rank and represents the start and end positions in the array of the ids. Example 5.2.

Suppose we are looking for the right pattern ‘la’ . Figure 5.1 showsin dark the node at which we stop searching for the pattern, and in gray the phrasesthat start with that pattern. The answer is the range [8 ,

9] (i.e., the lexicographicalorder of the phrases). 69 .2 Primary Occurrences Chapter 5 An LZ77-Based Self-Index

Remark 5.3.

Recall from Section 2.9 that in a PATRICIA tree, after searching forthe positions we need to check if they are actually a match, as some characters arenot checked because of the skips. In the example presented above, the answer wouldhave been the same if we were searching for any right pattern of the form l x , where x is a character distinct from a . We use a diﬀerent method here, which is explainedin Section 5.2.3.We do not explicitly store the skips in our theoretical proposal, as they can becomputed from the tree and the text. Given a node in the trie, if we go to the leftmostand rightmost leaves, we can extract the corresponding suﬃxes until computing howmany characters they share. This value will be the sum of all the skips from theroot to the given node. However, we already know they share S characters, where S is the sum of all skips from the root to the previous node (i.e., the parent node).Therefore, to compute the skip, we extract the suﬃxes of both leaves skipping theﬁrst S characters. The amount of symbols shared by both extracted strings will bethe skip. Extracting a skip of length s will take at most O ( sH ) time both for LZ77and for LZ-End, since the extraction is from left to right and we have to extract onecharacter at a time until they diﬀer. Thus, the total time for extracting the skips aswe descend is O ( mH ). To ﬁnd the left part P [1 , i ] of the pattern we have a trie (actually a PATRICIA trie,Section 2.9) that indexes all the reversed phrases, stored as a compact labeled tree(Section 2.8). Thus to ﬁnd the left part of the pattern in the text we need to searchfor ( P [1 , i ]) rev in this trie. The array that stores the leaves of the trie is called rev id and is stored explicitly.The search process and the considerations for this tree are exactly the same asthe ones for Section 5.2.1. The only diﬀerence with the suﬃx trie is that thecomputation of the skips is faster now. Our text extraction algorithm works fromright to left and since the text is reversed our algorithm outputs the characters inthe correct order. Thus, extracting a skip of length s takes O ( sH ) time for LZ77 and O ( s + H ) time for LZ-End. However, in the worst case the total time would still be O ( mH ) as all skips may be of length 1. Example 5.4.

Suppose we are looking for the left pattern ‘a’ . Figure 5.2 shows ingray the node at which we stop searching for the pattern. In this case we end up ina leaf, so that is the only phrase that ends with the given pattern. The answer is therange [4 , .2 Primary Occurrences Chapter 5 An LZ77-Based Self-Index rev_iddfudschars ((()(()()))()()()()())_$a$labdlr ((((((()(())(()))))))) _ a b d l rl$ $ a Figure 5.2: The reverse trie for the string ‘alabar a la alabarda$’ . The gray leafis the node at which we stop searching for the pattern ‘a’ . In the previous steps we found two intervals, one in the id array and the other inthe rev id array. These intervals represent the sets of phrases where the matches ofthe right side of the pattern start ( id array interval) and the phrases ending withthe left side of the pattern ( rev id array interval). Actual occurrences of the patternare composed of consecutive phrases. Hence, to ﬁnd the occurrences of the pattern,we need to ﬁnd which ids in the right interval are consecutive to those rev ids inthe left interval. For doing so we use a range structure (see Section 2.6.1) thatconnects the consecutive phrases in both trees. Figure 5.3 shows the range datastructure connecting both trees for our example string and below the sequence thatis represented with the wavelet tree.This structure is built from a permutation π on [1 , . . . , n (cid:48) ]. This permutation isjust an array containing for each id (column) the corresponding rev id (row). Inother words, the permutation holds that id [ i ] = 1 + rev id [ π ( i )]. For our example thepermutation array would be { } (note that we count from left toright and from bottom to top, and that we assume that rev id [0] = 0). Example 5.5.

Suppose we are looking for the pattern ‘ala’ . The possible partitionsare ( a , la ) and ( al , a ). Figure 5.3 shows in gray the ranges obtained when searchingfor the left and right part of partition ( a , la ). Then we look for all points inside thoseranges, obtaining the only primary occurrence that starts at phrase 1. The sameprocedure is carried out for the other partition.71 .2 Primary Occurrences Chapter 5 An LZ77-Based Self-Index Figure 5.3: The range structure for the string ‘alabar a la alabarda$’ . The graycircle marks the only primary occurrence of the pattern ‘ala’ , and the gray nodesshow the ranges deﬁned by the left and right part of the pattern.

Remark 5.6.

The range structure allows us to compute id [ i ], just storing the rev id array. Say we want to compute id [ i ]. We extract the value S [ i ] from the wavelettree, giving us the row p where the corresponding reverse id is. Then we compute id [ i ] = 1 + rev id [ p ]. Example 5.7.

Say we want to compute id [6] (i.e., the phrase id of the 6th lexico-graphical smallest phrase). We extract from the wavelet tree the 6th symbol, gettingthe value 3. This value is the lexicographical order of the reversed 5th phrase. Com-puting rev id [3] = 7, we know that the 5th phrase is phrase number 7. Hence, the6th phrase is phrase number 8 (i.e., id [6] = 8).At this stage we also have to validate that the answers returned by the searchquery are actual occurrences, as the PATRICIA tries by themselves do not guaranteethe pattern found is actually a match (see Remark 5.3). For the ﬁrst occurrence72 .2 Primary Occurrences Chapter 5 An LZ77-Based Self-Index reported by the range data structure we extract the substring of length m starting atthe reported position and check if it matches the pattern. If so we can ensure that allthe other reported occurrences match the pattern as well, otherwise no occurrence isa match. This process works because all occurrences reported by both tries share allcharacters, thus all occurrences reported by the range query share all characters. Wecheck the validity of the occurrences here as the range check is cheaper than extractingtext and we want to extract text only when a candidate to complete occurrence isfound.This structure adds O (log n (cid:48) ) time to the search phase, and O (log n (cid:48) ) time perprimary occurrence found.Note that we are able to answer exists queries with the structures explained sofar. If the number of occurrences reported by the range search is greater than one,then we check if one of those queries is an actual match. If there is a match, then thepattern is present in the text. The special primary occurrences could be found using the same steps explained abovefor primary occurrences, taking the left part of the pattern as the pattern itself andthe right side of the pattern as the empty string ε . However, we do know that lookingfor ε in the suﬃx trie will return the complete tree, thus making the search in therange structure unnecessary. For this reason we call this type of occurrence specialprimary, as we search for them slightly diﬀerently from the primary ones. For theseoccurrences we just need to search for P rev in the reverse trie.Since the search P rev in the reverse trie gives us a range in the rev id array, wedecided to store it explicitly instead of the id array. Furthermore, the result of therange search gives us positions in the rev id array. From the range structure we obtain the phrase id where an occurrence lies. Then weneed to convert it to a real text position. For doing so, we use a bitmap that marksthe ends of phrases. This bitmap is the same B used in Chapter 4 for extractingtext. Figure 5.4 shows the bitmap for the example string. The bitmap is below theparsing. 73 .2 Primary Occurrences Chapter 5 An LZ77-Based Self-Index Figure 5.4: The bitmap B of phrases for the string ‘alabar a la alabarda’ The conversion between phrase ids and text positions takes constant time as fol-lows: • phrase ( pos ) = 1 + rank ( B, pos − pos . • f irst pos ( id ) = select ( B, id −

1) + 1: position of the ﬁrst character of phrase id . • last pos ( id ) = select ( B, id ): position of the trailing character of phrase id .Recall from Section 4.2.1 that this bitmap also allows us to compute the lengthof the phrase as length ( id ) = select ( B, id + 1) − select ( B, id ). Here we explain some considerations we made when implementing our index. • Skips: as the average value for the skips is usually very low and computing themfrom the text phrases is slow in practice, we considered storing the skips, forone or for both tries, using the

Directly Addressable Codes (Section 2.4.1). Notethat in this case we never access array id nor rev id during the trie traversal,they are only accessed when checking and reporting the occurrences. • Binary Search: instead of storing the trie we can do a binary search over theids (rev ids) of the suﬃx trie (reverse trie). For the suﬃx trie, we do nothave explicitly the array of ids, but as shown in Remark 5.6 we can retrievethem using the range structure and the rev ids array. This alternative modiﬁesthe complexity of searching for a preﬁx/suﬃx of P to O ( mH log n (cid:48) ) for LZ77 or O (( m + H ) log n (cid:48) ) for LZ-End (actually, since we extract the phrases right-to-left,binary search on the reverse trie costs O ( m log n (cid:48) ) for LZ-End). Additionally,we could store explicitly the array of ids, instead of accessing them through therev ids. Although this alternative increases the space usage of the index anddoes not improve the complexity, it gives an interesting trade-oﬀ in practice.74 .3 Secondary Occurrences Chapter 5 An LZ77-Based Self-Index Secondary occurrences are found from the primary occurrences and, recursively, fromother previously discovered secondary occurrences.

The idea to ﬁnd the secondary occurrences is to locate all sources (of the LZ parsing)covering the occurrence and then mapping their corresponding phrases to real textpositions. To do this we use another bitmap, called bitmap of sources B S . Thebitmap is built by ﬁrst writing in unary the amount of empty sources ( ε ) and thenfor each position of the text writing in unary how many sources start at that position.In this way each corresponds to a source and a represents the position wherethe sources ( s) immediately preceding it start. Figure 5.5 shows the sources andthe corresponding phrases they generate (except the empty sources), and below theresulting bitmap. Since there are 3 empty sources the bitmap starts with , thenare 5 sources starting at position 1, hence follows, then just one source startingat position 2, adding , and ﬁnally one for each remaining position.Additionally, we need a permutation P S connecting the s in the bitmap B ofphrases (recall Section 5.2.5) to the s in the bitmap B S of sources. The sourcesstarting at a given position are sorted by increasing length, thus the last before a marks the longest source starting at that position. An example is given in Figure5.6. This permutation replaces the array source of Section 4.2.1Figure 5.5: Marking sources on bitmap B S For each occurrence found, we ﬁnd the position pos of the corresponding to itsstarting position in the bitmap of sources. Then we consider all the s to the left75 .3 Secondary Occurrences Chapter 5 An LZ77-Based Self-Index depthssourcesphrasespermutation Figure 5.6: Permutation connecting bitmap of phrases B (bottom) and bitmap ofsources B S (top)of pos . We convert each source to its target phrase, compute its length and see ifthe source covers the occurrence. If so, we report it as a secondary occurrence andrecursively generate all secondary occurrences from this new occurrence. In case thesource does not cover the occurrence, we stop the process and continue processingthe remaining occurrences. The algorithm is depicted in Figure 5.7. secondaryOcc ( start, len ) pos ← select ( B S , start + 1) source id ← pos − start − while source id > do phrase id ← P − S ( source id ) source start ← select ( B S , source id ) − source id if source start + len ( phrase id ) ≥ start + len then occ pos ← f irst pos ( source id ) + start − source start report occ pos secondaryOcc( occ pos, len ) else return source id ← source id − T [ start, start + len ] (preliminaryversion) Example 5.8.

Consider the only primary occurrence of the pattern ‘la’ startingat position 2. We ﬁnd the third in the bitmap of sources at position 12. Then weconsider all ones starting from position 11 to the left. The ﬁrst at position 11 mapsto a phrase of length 2 that covers the occurrence, hence we report an occurrence atposition 10. The second maps to a phrase of length 6 that also covers the occurrence,thus we report another occurrence at position 15. The third maps to a phrase oflength 1, hence it does not cover the occurrence and we stop. We proceed recursivelyfor the secondary occurrences found at position 10 and 15.76 .3 Secondary Occurrences Chapter 5 An LZ77-Based Self-Index Remark 5.9.

The method explained above is just introductory, as it does not workfor general LZ77-like parsings. It only works for parsings in which no source strictlycontains another source. Is it easy to see that if a source S is strictly covered byanother S some secondary occurrences are lost. Let M be a match of the patternsought and let M be between the rightmost positions of S and S . Then, as S is theﬁrst source to the left of M , we test if it covers M , stopping the process. However, S does cover M and produces a secondary occurrence, which was not detected bythe algorithm presented above. Example 5.10.

Let us start with the primary occurrence of the pattern ‘ba’ startingat position 4. The ﬁrst source to the left is ‘la’ , at position 2 and of length 2,which does not cover the pattern. Hence, the algorithm explained above would stop,reporting no secondary occurrences. However, to the left of this source is the source ‘alabar’ that does cover the pattern and generates the secondary occurrence startingat position 16.

K¨arkk¨ainen in his thesis [K¨ar99] proposes a method for converting the LZ77 parsinginto one in which no source contains another. However, we decided not to use it asit increases excessively the number of phrases. Recall that our index will use spaceproportional to the number of phrases of the parsing, thus any increase in the numberof phrases aﬀects directly the ﬁnal size of the index.Another proposal of K¨arkk¨ainen is to separate the sources by levels, so that withina level no source strictly contains another, and then apply the method explained inSection 5.3.1 within each level.

Deﬁnition 5.11.

The depth of a source is deﬁned as depth ( s ) = (cid:26) cover ( s ) = ∅ s (cid:48) ∈ cover ( s ) depth ( s (cid:48) ) otherwise , where cover ( s ) is the set of all sources containing the source s . Let S , S be twosources starting at p , p and of lengths l , l . S is said to cover S if p < p and p + l ≥ p + l . Note that, by deﬁnition, sources starting at the same position arenot covered by each other. However, sources ending at the same position may covereach other. For s = ε we deﬁne depth ( ε ) = 0.Figure 5.8 shows the additional array storing the depths of each source. The foursources ‘a’ and the source ‘alabar’ have depth equal to 0 as all of them start at thesame position. Source ‘la’ has depth 1, as it is contained by the source ‘alabar’ .77 .3 Secondary Occurrences Chapter 5 An LZ77-Based Self-Index depthssourcesphrasespermutation Figure 5.8: The depth of the sources for the string ‘alabar a la alabarda$’

The process now is similar to the idea presented earlier; however, now when weﬁnd a source not covering the occurrence we look for its depth d and then considerto the left only sources with depth d (cid:48) < d , as those at depth ≥ d are guaranteednot to contain the occurrence. This process works because in each level it holds thatsources to the left will end earlier than the current source, because of the deﬁnitionof depth. Moreover, sources at higher depths to the left will also end earlier as theyare contained in a source of the current depth to the left.Now the total running time to ﬁnd all occ secondary occurrences given a seedoccurrence is Ω( ε occ · L ) and O ( ε occ · L + D ), where ε is the parameter for computingthe inverse permutation P − S (Section 2.7), L is the time to ﬁnd the next element tothe left with depth less than a given value (an operation we consider next), and D is the maximum depth. The additional O ( D ) cost is because in the worst case afterﬁnding the last occurrence we will be in a source of depth D , then move to a source ofdepth D −

1, that does not yield an occurrence, and so on up to a source of depth 1.

As explained above, we need to be able to, given a position pos in an array U and avalue v , ﬁnd the rightmost position p preceding pos for which it holds U [ p ] ≤ v . Wewill call this query prevLess ( U, p, v ).To solve this query we will encode U (i.e., the array of depths of Section 5.3.2)using a wavelet tree (Section 2.6) supporting this additional operation. The algorithmdescends according to the bits of value v . If the value v gets mapped to a werecursively search in the left subtree. If the value v gets mapped to a we recursivelysearch in the right subtree. In this case, as the answer could be at the left side, welook for the rightmost preceding pos in the bitmap of the wavelet tree node. Findingthis takes constant time using rank and select . Finally, we return the maximum ofthe value returned by the right subtree and the rightmost zero.The pseudocode of the algorithm is presented in Figure 5.9. The algorithm re-78 .4 Query Time Chapter 5 An LZ77-Based Self-Index ceives as parameters a wavelet tree tree , a position pos , and a value v , and returns prevLess ( array ( tree ) , pos, v ), where array ( tree ) are the values represented by thewavelet tree. The bitmap of the wavelet tree is denoted tree.B . Function toBit returns to which side the value goes, and its output depends on the level. prevLess ( tree, pos, v ) //toBit depends on the level if toBit ( v, tree ) = 0 then lpos ← prevLess ( tree.lef t, rank ( tree.B, pos ) , v ) return select ( tree.B, lpos ) else //rightmost zero rm ← select ( tree.B, rank ( tree.B, pos )) lpos ← prevLess ( tree.right, rank ( tree.B, pos ) , v ) return max { rm , select ( tree.B, lpos ) } Figure 5.9: PrevLess algorithmAs the algorithm just performs constant-time operations at each level, its totalrunning time is L = O (log D ).If, we label each source with its depth, and label the changes to the next textposition with a D + 1, , we can get rid of the original bitmap of sources. Since thewavelet tree also supports rank and select queries we have the same functionality asthe bitmap of sources, yet with the ability to answer prevLess queries. However, as inpractice the bitmap of sources is very sparse, we preferred to use a δ -encoded bitmapto represent it and the wavelet tree for the depths.Using this operation we can now modify algorithm secondaryOcc of Figure 5.7.We keep track of the maximum depth d for which there may be sources covering theoccurrence. When a source does not cover the occurrence, we update the value of d . Using the operation prevLess , we move to the next candidate source. The ﬁnalalgorithm is presented in Figure 5.10. Combining all the steps gives us the total running time to ﬁnd the occurrences. • Primary Occurrences: the total time is O ( m ( F ind sstm + F ind revm + log n (cid:48) +79 .4 Query Time Chapter 5 An LZ77-Based Self-Index secondaryOcc ( start, len ) pos ← select ( B S , start + 1) source id ← pos − start − // D is the maximum depth d ← D while source id > do phrase id ← P − S ( source id ) source start ← select ( B S , source id ) − source id if source start + len ( phrase id ) ≥ start + len then occ pos ← f irst pos ( source id ) + start − source start report occ pos secondaryOcc( occ pos, len ) else d ← depth [ source id ] − if d < then return source id ← prevLess ( depth tree, source id, d )Figure 5.10: Searching for secondary occurrences from T [ start, start + len ] Extract m ) + occ log n (cid:48) ), where F ind sstm is the time to search for a subpatternof length m in the suﬃx trie, F ind revm is the time to search for a subpat-tern of length m in the reverse trie, and occ is the number of primary oc-currences. Thus the time to count the occurrences in the range structure is O (log n (cid:48) ), the total time to locate the primary occurrences in the range struc-ture is O ( occ log n (cid:48) ), and Extract m is the time to extract m characters toverify the PATRICIA searches. Extract m , as said in Section 4.2.2, dependson the parsing and is O ( mH ) for LZ77 and O ( m + H ) for LZ-End in the worstcase. As our experiments show later (Section 6.1), in practice the diﬀerence isnot as drastic: LZ77 is about 3 times (for most texts) slower for long substringsand not much slower for short substrings.The F ind times depend on the structures used: – Tries: O ( F ind sstm ) = O ( F ind revm ) = O ( m + Skips ) = O ( m + mH ) = O ( mH )(as the time to compute all skips is O ( mExtract ) in the worst case). – Tries+Skips: O ( F ind sstm ) = O ( F ind revm ) = O ( m + Skips ) = O ( m ) (as theskips are stored). – Binary Search: O ( F ind sstm ) = O (log n (cid:48) · Extract m ) if we store the ar-80 .5 Construction Chapter 5 An LZ77-Based Self-Index ray of ids explicitly, otherwise the time is O ( F ind sstm ) = O (log n (cid:48) (log n (cid:48) + Extract m )). O ( F ind revm ) = O (log n (cid:48) · Extract m ) on LZ77, and O ( m log n (cid:48) )on LZ-End (as the extraction takes constant time per extracted symbol inthis case). Here we save the veriﬁcation of PATRICIA trees but this hasno eﬀect on the total complexity.With this the total time using tries is O ( m H + m log n (cid:48) + occ log n (cid:48) ), indepen-dent of the parsing. When adding skips the time drops to O ( m + mH + m log n (cid:48) + occ log n (cid:48) ) on LZ-End. When using, instead, binary searching, the time is O ( m ( m + H ) log n (cid:48) + occ log n (cid:48) ) for LZ-End and O ( m H log n (cid:48) + occ log n (cid:48) )for LZ77 if we store the id array explicitly, otherwise the time increases to O ( m (log n (cid:48) + m + H ) log n (cid:48) + occ log n (cid:48) ) for LZ-End and O ( m (log n (cid:48) + mH ) log n (cid:48) + occ log n (cid:48) ) for LZ77. • Secondary Occurrences: the total time is O ( ε occ (log D + D )), where D is the maximum depth and ε is the parameter for the representation of thepermutation (Section 2.7). Recall from Sections 5.3.1 and 5.3.2 that the timeto ﬁnd the secondary occurrences from a seed is O ( ε occ log D + D ). Howeverin this case we are recursively locating the secondary occurrences from all theoccurrences found, and in a worst case we could pay O ( D ) for each occurrence,not ﬁnding new ones.Taking ε = n (cid:48) gives us a total time similar to the one given for the primaryoccurrences, except that occ log n (cid:48) changes to occ · D log n (cid:48) .To solve exists queries, we basically search for the ﬁrst primary occurrence. Hencethe total time is as given for the primary occurrences replacing occ = 0 (the detailscan be seen in Table 5.1). In this section we explain the construction algorithm of the proposed index. Wepropose a practical construction algorithm, with bounded space usage and decenttimes. (See table 5.2 for a reminder of the deﬁnitions of the variables.)1.

Alphabet mapping: as we work with standard texts that represent eachsymbol using 1 byte, we map the byte values to eﬀective alphabet positions andvice versa. 81 .5 Construction Chapter 5 An LZ77-Based Self-Index LZ parsing:

For LZ77 we use the algorithm CPS2 of Chen et al. [CPS08] andfor LZ-End we use the algorithm described in Section 4.4. At this stage wegenerate three diﬀerent ﬁles containing the trailing characters, the lengths ofthe sources, and the starting positions of the sources. Using suﬃx trees (Sec-tion 2.10), the LZ77 parsing can be computed in O ( n ) time using O ( n ) words ofspace, and the LZ-End parsing in O ( N ) time using the same space. Additionally,the LZ77 parsing can be computed theoretically in O ( n log n (log (cid:15) n + o (log σ )))time using n ( H k ( T ) + 2) + o ( n log σ ) bits and the LZ-End parsing in time O (log n ( N log (cid:15) n + n o (log σ ))) using n ( H k ( T ) + 2) + n (cid:48) log n + o ( n log σ ) bits(Section 4.4). The practical algorithms take total time O ( n log n ) for LZ77 and O ( N log n ) for LZ-End (recall Section 4.4). The space usage is around 5 . n bytes for LZ77 and 9 n bytes for LZ-End, and this is the peak space usage forthe self-index construction. In the index we only store explicitly the trailingcharacters using n (cid:48) log σ bits.3. Bitmap of phrases: this bitmap is easily computed from the array containingthe lengths of the sources in time O ( n (cid:48) ). It uses n (cid:48) log nn (cid:48) + O ( n (cid:48) + n log log n log n )bits (Section 2.5). In practice we use δ -encoded bitmaps, using n (cid:48) log nn (cid:48) + O ( n (cid:48) log log n + n (cid:48) log n (cid:48) s ) bits, where s is the sampling step. All query timesare then multiplied by s + log n (cid:48) s . For the analysis we will assume s = log n (cid:48) .4. Suﬃx trie and reverse trie: for constructing these trees, we decided toinsert all indexed substrings in a PATRICIA trie. This is O ( n ) time for thereverse trie, but it could be quadratic for the suﬃx trie (there are complex O ( n )-time algorithms for building suﬃx tries [IT06]). In practice this does nothappen and the running time is good, as the number of phrases n (cid:48) generatedby the parsing is relatively small. Of course we insert and store pointers to thetext in the trie, rather than the whole strings. From the PATRICIA trie, we canextract the sorted ids, the skips and the DFUDS representation of the tree intime O ( n (cid:48) ). Each tree will have at most 2 n (cid:48) nodes, hence they require 4 n (cid:48) + o ( n (cid:48) )bits (Section 2.8) for the topology of the tree, plus 2 n (cid:48) label characters encodedusing 2 n (cid:48) log σ bits. Additionally, the rev ids are stored using n (cid:48) log n (cid:48) bits.5. Range structure: to build the range structure we need a permutation fromthe ids of the suﬃx tries to the ids of the reverse trie. This is done in O ( n (cid:48) )time, inverting the permutation of the ids and then traversing the rev ids andassigning each to the corresponding id. Then, the range structure is built start-ing from the permutation in time O ( n (cid:48) log n (cid:48) ). It uses n (cid:48) log n (cid:48) + O ( n (cid:48) log log n (cid:48) )bits (Section 2.6.1).6. Sources depths: for computing the secondary occurrences related structures82 .5 Construction Chapter 5 An LZ77-Based Self-Index we need to ﬁrst compute the depth of each source. First we sort all sourcesby increasing starting position, breaking ties by decreasing length. Doing thiswe know that all parents of a source are to its left. We keep track of therightmost source of depth d for each possible depth. Then for each source webinary search the rightmost sources and ﬁnd the deepest source d that coversthe current phrase. Afterward, we set the current source as the rightmost sourceof depth d + 1. The running time of the algorithm is O ( n (cid:48) log n (cid:48) ).7. Prev-Less Depth Structure: this structure is constructed in O ( n (cid:48) log D )time as it is just a wavelet tree It uses n (cid:48) log D + O ( n (cid:48) log log D ) bits (Section5.3.3).8. Source-Phrase Permutation: it takes O ( n (cid:48) ) time starting from the ids ofthe sorted sources. It is stored using (1 + ε ) n (cid:48) log n (cid:48) + O ( n (cid:48) ) bits (Section 2.7),and as we set ε = n (cid:48) , the total space is n (cid:48) log n (cid:48) + O ( n (cid:48) ) bits.9. Bitmap of Sources: it takes O ( n (cid:48) ) time to build from the starting positionsof the sorted sources. It uses n (cid:48) log nn (cid:48) + O ( n (cid:48) + n log log n log n ) bits (Section 2.5). Inpractice we use δ -encoded bitmaps, so the same considerations as for the bitmapof phrases apply.Adding up the space of all structures we get that the index requires 2 n (cid:48) log n + n (cid:48) log n (cid:48) + n (cid:48) log D + O ( n (cid:48) log σ + n log log n log n ) bits of space, which in our practical im-plementation is 2 n (cid:48) log n + n (cid:48) log n (cid:48) + n (cid:48) log D + O ( n (cid:48) log σ + n (cid:48) log log n ) bits plus theskips we store. Note that in the case of binary searching we do not use tries, yet theasymptotic space complexity is not reduced.Note that our practical index space is fully proportional to n (cid:48) , depending on n only logarithmically.For the construction time and space of the index we have given practical ﬁgures.We give now two trade-oﬀs for the theoretical upper bounds. The ﬁrst, Theory ,uses the least possible construction time, and the second, Theory , the least possibleconstruction space. • Theory : The space gets dominated by the O ( n log n ) bits needed to buildthe parsing. All construction times are O ( n (cid:48) log n (cid:48) ), except the parsing andcreating the PATRICIA trees. Hence, the index is built in time O ( n + n (cid:48) log n (cid:48) )( O ( N + n (cid:48) log n (cid:48) ) for LZ-End) using O ( n log n ) bits. • Theory : In Section 4.4 we showed that the LZ77 parsing can be computed in O ( n log n (log (cid:15) n + o (log σ ))) time using O ( nH k ( T )) + o ( n log σ ) bits and the LZ-83 .5 Construction Chapter 5 An LZ77-Based Self-Index End parsing in time O (log n ( N log (cid:15) n + n o (log σ ))) using the same space. Alldata structures (except PATRICIA trees) are constructed as explained above,each structure requiring at most O ( n (cid:48) log n ) = O ( nH k ( T )) + o ( n log σ ) (recallLemma 4.13) and O ( n (cid:48) log n (cid:48) ) time. In the following we show that we can buildthe PATRICIA trees in time O ( n (cid:48) log (cid:15) n ) using O ( nH k ( T ) + o ( n log σ )) bits.The idea is similar to that presented by Claude and Navarro [CN10] enhancedwith some ideas from Russo et al. [RNO08]1) First we build the FM-index [FM05] of T in O ( n log n log σ log log n ) time within nH k ( T ) + o ( n log σ ) bits of space. 2) Then, we build the Fully-Compressed Suf-ﬁx Tree (FCST) [RNO08], that supports all tree operations in O (log ε n ) time.For building the FCST, we simulate a traversal of the suﬃx tree starting fromthe root using Weiner links [Wei73], which are simulated using the LF map-ping (see Section 2.12) over the FM-Index. During the traversal we mark allnodes that are at depth multiple of δ in the implicit tree deﬁned by the Weinerlinks, where δ is the space/time trade-oﬀ parameter of the FCST (which takes o ( n log σ ) bits of space for δ = log (cid:15) n ). These nodes are stored in a simplearray using o ( n log σ ) bits with all the information required by the FCST con-struction algorithm. The running time of this process is O ( n log σ log log n ) and thespace is o ( n log σ ) bits. 3) We use a dynamic balanced tree to mark some ofthe nodes of the FCST; these will be the nodes of our PATRICIA tree. Foreach phrase starting position we convert it using A − to an FM-Index position,and then selectLeaf (which gives the i -th leaf) converts it to a position in theFCST. We mark in the balanced tree the preorder of the FCST node, as well asthe phrase id. Then, we traverse the balanced tree from left to right computing LCA ( x i , x i +1 ) (where x i is the current node and x i +1 is the node to the right,and LCA is their lowest common ancestor in the FCST) and inserting the valuein the tree. To build the PATRICIA tree, we traverse the balanced tree againfrom left to right, creating the PATRICIA nodes, generating the parenthesesrepresentation and the labels of the edges. Using the operation letter of theFCST (which gives any letter of the path leading to a node) we retrieve thelabel. Using the FCST we determine the topology of the tree (we keep thecurrent PATRICIA path in a stack; add closing parentheses and pop the stackuntil the top of the stack is an ancestor of the new node; and then we add theopening parenthesis for the current node and push it to the stack). This stepruns in O ( n (cid:48) log (cid:15) n ) time and within O ( n (cid:48) log n ) = O ( nH k ( T )) + o ( n log σ )(recall Lemma 4.13) bits of space.Hence, the total time is O ( n log n log σ log log n + n (cid:48) log ε n ) and the total space is84 .6 Summary Chapter 5 An LZ77-Based Self-Index O ( nH k ( T )) + o ( n log σ ). The process for constructing the reverse trie is almostthe same, but now we do not consider the whole suﬃxes, because they are lim-ited by the phrase length. Given A − ( pos ), we use the operation LAQ S ( d ) ofthe FCST (which retrieves the ancestor with string depth d ) to ﬁnd which nodewe need to mark.As the time for constructing the PATRICIA trees gets dominated by the pars-ing algorithm the total space required is O ( nH k ( T )) + o ( n log σ ) and the totalrunning time is O ( n log n (log (cid:15) n + o (log σ ))) for LZ77 and O (log n ( N log (cid:15) n + n o (log σ ))) for LZ-End. We have presented a self-index that given a text of length n , parsed into n (cid:48) diﬀerentphrases by a Lempel-Ziv like parsing, uses space proportional to that of the com-pressed text, i.e., O ( n (cid:48) log n ) + o ( n ) bits. Table 5.1 summarizes the space and timeof the operations over the index and Table 5.2 summarizes all the parameters of theindex. In practice, due to our sparse bitmap representation, the o ( n ) bits disappearfrom the space but the times of extract , exists and locate are multiplied by O (log n (cid:48) ).85 .6 Summary Chapter 5 An LZ77-Based Self-Index Tries Binary SearchConstruction Time Theory : O ( n + n (cid:48) log n (cid:48) ) for LZ77 and O ( N + n (cid:48) log n (cid:48) ) forLZ-EndTheory : O ( n log n (log (cid:15) n + o (log σ ))) for LZ77 and O (log n ( N log (cid:15) n + n o (log σ ))) for LZ-EndPractice: O ( n log n ) for LZ77, O ( N log n ) for LZ-End.Construction Space Theory : O ( n log n ) bitsTheory : O ( nH k ( T )) + o ( n log σ )Practice: LZ77 ≈ n bytes, LZ-End ≈ n bytesIndex Space Theory: 2 n (cid:48) log n + n (cid:48) log n (cid:48) + n (cid:48) log D + O ( n (cid:48) log σ + n log log n log n ) bitsPractice: 2 n (cid:48) log n + n (cid:48) log n (cid:48) + n (cid:48) log D + O ( n (cid:48) log σ + n (cid:48) log log n )bitsExtract Time LZ77: O ( mH ) , LZ-End: O ( m + H )Exists Time O ( m H + m log n (cid:48) )With skips and LZ-End: O ( m + mH + m log n ) Using n (cid:48) log n (cid:48) additional bits: LZ77: O ( m H log n (cid:48) ), LZ-End: O ( m ( m + H ) log n (cid:48) ).Otherwise: LZ77: O ( m (log n (cid:48) + mH ) log n (cid:48) ),LZ-End: O ( m (log n (cid:48) + m + H ) log n (cid:48) )Locate Time O ( m H + m log n (cid:48) + occ · D log n (cid:48) )With skips andLZ-End: O ( m + mH + m log n (cid:48) + occ · D log n (cid:48) ) Using n (cid:48) log n (cid:48) additional bits: LZ77: O ( m H log n (cid:48) + occ · D log n (cid:48) ), LZ-End: O ( m ( m + H ) log n (cid:48) + occ · D log n (cid:48) ).Otherwise: LZ77: O ( m (log n (cid:48) + mH ) log n (cid:48) + occ · D log n (cid:48) ), LZ-End: O ( m (log n (cid:48) + m + H ) log n (cid:48) + occ · D log n (cid:48) ) Table 5.1: Summary table of LZ77-Index. Adding skips requires at most 4 n (cid:48) log n more bits, but far less in practice. In practice times are multiplied by O (log n (cid:48) ). Parameter Description Deﬁned in σ size of the alphabet Section 2.1 n length of the text Section 2.1 n (cid:48) length of the LZ parsing Deﬁnitions 2.30 and 4.2 m length of the pattern Section 2.2 s sampling step of δ -encoded bitmap Section 2.5.2 ε parameter of the permutation Section 2.7 D maximum depth of the sources Section 5.3.2 H height of the LZ parsing Deﬁnition 4.5 N total text retraversed in the LZ-End parsing Section 4.4 Table 5.2: Summary table of parameters of LZ77-Index86 hapter 6Experimental Evaluation

In our tests we compared the proposed index against RLCSA [NM07]. We did not testthe SLP Index [CN09] because we could not make it run consistently in our collec-tions, yet some comparison results can be inferred from their experimental evaluation[CFMPN10], as we do in Section 6.2.We used in our experiments the LZ77 and the LZ-End parsings. For the LZ indexeswe used the following variants (deﬁned in Section 5.2.6), ordered by decreasing spacerequirement. In all variants we stored the skips of the trees using DAC (Section 2.4.1),because not using them lead to results worse than using binary search (our slowestvariant).1. Suﬃx trie and reverse trie (original proposal).2. Binary search on ids with the explicit ids and reverse trie.3. Binary search on reverse ids and suﬃx trie.4. Binary search on ids with explicit ids and binary search on reverse ids.5. Binary search on ids with implicit ids and binary search on reverse ids.Recall from Section 5.2.6 that the array of the ids is not stored in the index, onlythe array of reverse ids. Thus, if we want to binary search over ids we have twoalternatives: (1) spend n (cid:48) log n (cid:48) additional bits to store explicitly the array of ids, or87 .1 Experimental Setup Chapter 6 Experimental Evaluation (2) using Remark 5.6 to access the array implicitly by paying O (log n (cid:48) ) access time.The index variants with explicit ids refer to the alternative (1) and the ones with implicit ids refer to alternative (2). The alternative using the suﬃx trie do not needto access the id array, but rather the reverse ids array, which is always maintained inexplicit form Remark 6.1.

The reader may note that the results concerning the alternative • Binary search on ids with implicit ids and reverse trieis not present. We omit the empirical results of this alternative as the compressionratio is about the same obtained using alternative number 3 and the performance of locate is noticeably worse. Remember that accessing the implicit array of ids takestime O (log n (cid:48) ).The parameters used for the data structure are as follows: s = 16 for the δ -codesbitmap (Section 2.5.2), ε = 1 /

32 for the permutation (Section 2.7) and sampling step b = 20 for the bitmaps of Gonz´alez et al. (Section 2.5.1). We used these parametervalues as they are the typical ones used in experimentation, and additionally withthese values our indexes achieve a good space/time trade-oﬀ.For RLCSA we used sampling with steps 512, 256, 128 and 64. The index wasbuilt using a buﬀer of 100MiB.All our experiments were conducted on a machine with two Intel Xeon CPUrunning at 2.00GHz with 16GiB main memory. The operating system is Ubuntu8.04.4 LTS with Kernel 2.6.24-27-server. The compiler used was g++ ( gcc version4.2.4) executed with the -O3 optimization ﬂag.We present the results obtained for the following texts (the results of only one textfrom each collection are presented in this section, the remaining results are presentedin Appendix A): • Artiﬁcial: F , R , T . • Pseudo-Real: DNA 0.1% (Scheme 1), Proteins 0.1% (Scheme 1), English 0.1%(Scheme 2), Sources 0.1% (Scheme 2). • Real: Para, Cere, Inﬂuenza, Escherichia Coli, Coreutils, Kernel, Einstein (en),Einstein (de), World Leaders. 88 .1 Experimental Setup Chapter 6 Experimental Evaluation

We restricted the experiments to the texts listed above since many of the textsproduced similar results during preliminary experiments. For DNA and Wiki do-mains, we chose the largest texts. For the case of pseudo-real texts we kept the DNA,Proteins, English and Sources texts, as these kind of texts naturally form repetitivecollections.Two types of experiments were carried out. One, labeled “results (1)” in Figures6.2-6.7 and A.1-A.26 , considers the time of the operations as a function of the patternlength | P | . The second, labeled “results (2)” in Figures 6.2-6.7 and A.1-A.26 , showsthe space/time trade-oﬀ of the operations. Although we do not show a space/timetuning for LZ77/LZ-End index, the plots of ﬁgures labeled “results (2)” show a lineformed by 5 points. These refer to the variants 1-5 described above. The results arepresented and discussed in Section 6.1. The experiments conducted are the following: • Construction time and space: we present the build time for each index aswell as the peak memory usage. Results are given in Figure 6.1. • Compression Ratio: we present the compression ratio for diﬀerent self-indexes.We show alternatives 1 and 5 of our indexes, which are respectively the largestand smallest variants. For RLCSA we show the space achieved with a samplingstep of 512, and without the samples, which is the lowest space reachable bythat index. For the LZ78-index [ANS06] (Section 2.14.3) we used ε = as thesampling step of the permutation. Additionally, we show the compression ratioof ILZI [RO08] (Section 2.14.3). The results are shown in Table 6.1, where wealso include p7zip as a baseline. • Structures Space: we present the space usage of the diﬀerent data structuresused in our indexes. The results are given in Tables 6.4 and 6.5 as percentageof the size of the index. • Parsing Statistics: we present the value of D and H , which aﬀect the perfor-mance of our indexes. The results are displayed in Tables 6.2 and 6.3. • Extraction speed: we extracted 10,000 substrings of length 2 i , i ∈ { , . . . , } .We show only one line for the LZ77 and the LZ-End index, as all the variantshave the same extraction speed. See the top-left plots of Figures 6.2-6.7 labeled“results (1)”, which are representative of all the results (the rest are in AppendixA, in Figures A.1-A.26). We also show the space/time trade-oﬀ of the indexesfor extracting a pattern of length 2 . Extraction times per character stabilizeat this length. See the top-left plot of Figures 6.2-6.7 and A.1-A.26 labeled“results (2)”. 89 .1 Experimental Setup Chapter 6 Experimental Evaluation • Search time: we located 1,000 patterns of length 10, 15, and 20. We limitedthe number of occurrences reported to 30,000. See plots 2-4 in reading orderof Figures 6.2-6.7 and A.1-A.26 labeled “results (2)”. We also located patternsof increasing length from 5 to 40. In this case we only show the results foralternative 1 (original proposal) and 5 (minimum space) of both LZ77 and LZ-End. See top-right plot of Figures 6.2-6.7 and A.1-A.26 labeled “results (1)”. • Locate time: we located 1,000 patterns of length 2 and 4. This test highlightsthe time needed to ﬁnd the occurrences in our indexes, as it dominates thetime for traversing the tries. We limited the number of occurrences reported to100,000. See plots 5-6 of Figures 6.2-6.7 and A.1-A.26 labeled “results (2)”. • Exists Time: we generated 20,000 patterns of lengths 5, 10, 20, 40 and 80; halfof them were present in the text and the other half were a random concatenationof symbols of the text. For RLCSA we check the existence using a count query.For this reason, we only show one line for RLCSA, as count time is independentof the sampling size. The exists query of the LZ77 index is basically a searchof a primary occurrence, and thus it illustrates the time for traversing the tries.We only show the results for alternative 1 (original proposal) of both LZ77 andLZ-End, since the other alternatives are orders of magnitude slower for this.See left-bottom plot of Figures 6.2-6.7 and A.1-A.26 labeled “results (1)” forexisting patterns and right-bottom plot for non-existing patterns. We also showthe space/time trade-oﬀ of these two queries for patterns of length 20, see plots7-8 of Figures 6.2-6.7 and A.1-A.26 labeled “results (2)”.90 .1 Experimental Setup Chapter 6 Experimental Evaluation F R T DN A . % P r o t e i n s . % E ng li s h . % S ou r c e s . % P a r a C e r e I n f l uen z a E sc he r i c h i a C o li C o r eu t il s K e r ne l E i n s t e i n ( en ) E i n s t e i n ( de ) W o r l d l eade r s T i m e ( s / M i B ) Index Construction Time

RLCSALZ77LZEnd 0 2 4 6 8 10 12 F R T DN A . % P r o t e i n s . % E ng li s h . % S ou r c e s . % P a r a C e r e I n f l uen z a E sc he r i c h i a C o li C o r eu t il s K e r ne l E i n s t e i n ( en ) E i n s t e i n ( de ) W o r l d l eade r s R a t i o o f o r i g i na l t e x t Index Construction Space

RLCSALZ77LZEnd

LZ77 Index LZ-End Index RLCSA

File

Time Space Time Space Time Space F R T Figure 6.1: Construction time and space for the indexes. Times are in seconds perMiB and spaces are the ratio between construction space and text space.91 .1 Experimental Setup Chapter 6 Experimental Evaluation

Text LZ78 ILZI RLCSA RLCSA

LZ77 LZ77 LZ-End LZ-End p7zip F R T Table 6.1: Compression ratio (given in percentage of original ﬁle size) of diﬀerentself-indexes. In bold are highlighted those LZ-based indexes outperforming the bestcompression achievable by RLCSA.

LZ77 Index LZ-End IndexText Mean D Mean DF R T Table 6.2: D value (i.e., maximum depth) and mean depth for the LZ indexes92 .1 Experimental Setup Chapter 6 Experimental Evaluation LZ77 Index LZ-End IndexText Mean H Mean HF R T Table 6.3: H value (i.e., maximum extraction cost) and mean extraction cost for theLZ indexes T r a ili n g c h a r a c t e r s B Su ﬃ x t r i e Su ﬃ x t r i e s k i p s R e v e r s e t r i e R e v e r s e t r i e s k i p s r e v i d s R a n g e D e p t h P S B S F R T Table 6.4: Detailed space of LZ77 index structures. Values are in percentage of thetotal size. 93 .1 Experimental Setup Chapter 6 Experimental Evaluation T r a ili n g c h a r a c t e r s B Su ﬃ x t r i e Su ﬃ x t r i e s k i p s R e v e r s e t r i e R e v e r s e t r i e s k i p s r e v i d s R a n g e D e p t h P S B S F R T Table 6.5: Detailed space of LZ-End index structures. Values are in percentage ofthe total size. 94 .1 Experimental Setup Chapter 6 Experimental Evaluation E x t r a c t i on s peed ( M c ha r s / s ) log(Snippet Length) Extraction SpeedT RLCSA

RLCSA

RLCSA LZ77LZ-End 0.0001 0.001 0.01 0.1 5 10 15 20 25 30 35 40 T i m e ( m s / o ccs ) Pattern Length

Locate TimeT RLCSA

RLCSA

RLCSA LZ77 LZ77 LZ-End LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns FoundT RLCSALZ77 LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns not FoundT RLCSALZ77 LZ-End Figure 6.2: T results (1). Note the logscales.95 .1 Experimental Setup Chapter 6 Experimental Evaluation T i m e ( µ s / c ha r) Compression Ratio

Extract Time (|P|=2 )T RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.0001 0.001 0.01 0.1 1 10 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=10)T RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.0001 0.001 0.01 0.1 1 10 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=15)T RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.0001 0.001 0.01 0.1 1 10 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=20)T RLCSALZ77LZ-End 0 0.005 0.01 0.015 0.02 0.0001 0.001 0.01 0.1 1 10 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=2)T RLCSALZ77LZ-End 0 0.005 0.01 0.015 0.02 0.0001 0.001 0.01 0.1 1 10 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=4)T RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 0.0001 0.001 0.01 0.1 1 10 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns Found (|P|=20)T RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 0.0001 0.001 0.01 0.1 1 10 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns not Found (|P|=20)T RLCSALZ77LZ-End

Figure 6.3: T results (2). Note the logscales.96 .1 Experimental Setup Chapter 6 Experimental Evaluation E x t r a c t i on s peed ( M c ha r s / s ) log(Snippet Length) Extraction SpeedDNA 0.1% RLCSA

RLCSA

RLCSA LZ77LZ-End 0.001 0.01 0.1 1 5 10 15 20 25 30 35 40 T i m e ( m s / o ccs ) Pattern Length

Locate TimeDNA 0.1% RLCSA

RLCSA

RLCSA LZ77 LZ77 LZ-End LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns FoundDNA 0.1% RLCSALZ77 LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns not FoundDNA 0.1% RLCSALZ77 LZ-End Figure 6.4: DNA 0.1% results (1). Note the logscales.97 .1 Experimental Setup Chapter 6 Experimental Evaluation T i m e ( m s / c ha r) Compression Ratio

Extract Time (|P|=2 )DNA 0.1% RLCSALZ77LZ-End 0 0.02 0.04 0.06 0.08 0.1 0.12 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=10)DNA 0.1% RLCSALZ77LZ-End 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=15)DNA 0.1% RLCSALZ77LZ-End 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=20)DNA 0.1% RLCSALZ77LZ-End 0.001 0.01 0.1 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=2)DNA 0.1% RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=4)DNA 0.1% RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 2 4 6 8 10 12 14 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns Found (|P|=20)DNA 0.1% RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 2 4 6 8 10 12 14 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns not Found (|P|=20)DNA 0.1% RLCSALZ77LZ-End

Figure 6.5: DNA 0.1% results (2). Note the logscales.98 .1 Experimental Setup Chapter 6 Experimental Evaluation E x t r a c t i on s peed ( M c ha r s / s ) log(Snippet Length) Extraction SpeedKernel

RLCSA

RLCSA LZ77LZ-End 0.001 0.01 0.1 1 5 10 15 20 25 30 35 40 T i m e ( m s / o ccs ) Pattern Length

Locate TimeKernel

RLCSA

RLCSA LZ77 LZ77 LZ-End LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns FoundKernel

RLCSALZ77 LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns not FoundKernel

RLCSALZ77 LZ-End Figure 6.6: Kernel results (1). Note the logscales.99 .1 Experimental Setup Chapter 6 Experimental Evaluation T i m e ( m s / c ha r) Compression Ratio

Extract Time (|P|=2 )Kernel RLCSALZ77LZ-End 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0 2 4 6 8 10 12 14 16 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=10)Kernel

RLCSALZ77LZ-End 0 0.005 0.01 0.015 0.02 0.025 0.03 0 2 4 6 8 10 12 14 16 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=15)Kernel

RLCSALZ77LZ-End 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0 2 4 6 8 10 12 14 16 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=20)Kernel

RLCSALZ77LZ-End 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0 2 4 6 8 10 12 14 16 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=2)Kernel

RLCSALZ77LZ-End 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0 2 4 6 8 10 12 14 16 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=4)Kernel

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 2 4 6 8 10 12 14 16 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns Found (|P|=20)Kernel

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 2 4 6 8 10 12 14 16 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns not Found (|P|=20)Kernel

RLCSALZ77LZ-End

Figure 6.7: Kernel results (2). Note the logscales.100 .2 Analysis of the Results Chapter 6 Experimental Evaluation

It can be seen from the results presented above that our indexes are competitive withRLCSA and in most cases they show better space/time trade-oﬀs.Figure 6.1 shows that our indexes are built more eﬃciently than RLCSA. Thespace needed to build the LZ77 index is about 60% of that of RLCSA, and for the caseof the LZ-End index the space is about 85% of that of RLCSA. The construction timefor LZ77 is about 35% of that of RLCSA, yet for LZ-End it is about 185% of that ofRLCSA (i.e., slower). Our space occupancy during construction is a great advantageagainst RLCSA as it allows us to build the index for larger texts using the sameresources. We could reduce construction space further by sacriﬁcing constructiontime, recall Section 4.4.The compression ratio of our indexes is usually superior to that of RLCSA (seeTable 6.1). When considering the lower bound of RLCSA, which only supports count and exists queries, our smallest index compresses better than RLCSA for all excepttexts inﬂuenza, coreutils and World Leaders. When considering RLCSA with a sam-pling step equal to 512, our compression is better for all except text coreutils. Fromnow on we compare our indexes with the RLCSA with sampling step 512. The com-pression diﬀerence is more noticeable for artiﬁcial texts, where our compression is100-1000 times better than RLCSA. For DNA collections Para and Cere our bestcompression space (always achieved with LZ77) is about 45% of RLCSA’s, yet itraises to 70-80% for Inﬂuenza and Escherichia Coli. The space is also 80% on WorldLeaders. On kernel our space is 60% of RLCSA, but they are almost the same onCoreutils. On Wikipedia articles our space is 20-30% of that of RLCSA. LZ-Endneeds more space that LZ77, losing to RLCSA in Inﬂuenza, Escherichia Coli, Core-utils and World Leaders. For texts of pseudo-real collections the compression ratioof LZ77 is about 60% of that of RLCSA, and even the alternative using more spacehas better compression ratios than RLCSA. It can also be seen in Figures 6.2-6.7 andA.1-A.26 labeled “results (2)” that in most cases the space/time trade-oﬀ is in favorof LZ77/LZ-End. Our LZ77 indexes use 2.63-7.52 times the space of p7zip and ourLZ-End indexes use 4.57-23.03 times the space, depending on the type of text andthe number of structures we use (e.g., tries). Finally, the compression ratios of LZ78-based indexes are not competitive at all (note that ILZI’s maximal parsing performsbetter than LZ78).We expected to improve the compression of RLCSA for highly repetitive texts,since LZ77 is more powerful to detect repetitions (see Lemma 4.1 and Section 2.14.1).For artiﬁcial texts, the most repetitive ones, this situation is even more clear. Forsuch texts, most of the space of the RLCSA corresponds to the space to represent the101 .2 Analysis of the Results Chapter 6 Experimental Evaluation samplings, which is diﬃcult to compress [MNSV10].Tables 6.2 and 6.3 show that D is generally moderate (below 42), and that thegreatest extraction cost is also moderate (at most 257 steps in LZ-End and at most98 steps in LZ77 except for the texts of Wikipedia). When taking into account themean values of the depth and extraction cost, the values decrease noticeably, beingthe average extraction cost below 25 steps for LZ-End and below 32 steps for LZ77(except for the texts of Wikipedia).Tables 6.4 and 6.5 show that the suﬃx trie and the reverse trie use more spacethan the rest of the data structures. Then we see that the range structure, the reverseids and the permutation use roughly the same space (they require n (cid:48) log n (cid:48) + o ( n (cid:48) log n (cid:48) )bits). For LZ-End we see that the space needed to represent both bitmaps increasesnoticeably, being even higher to that of the range structure, or the permutation.Finally, we see that the skips only use 11-13% and the wavelet tree of depths uses2-4% of the total space. Additionally, for the artiﬁcial texts, we see that more than90% of the space is used to represent the suﬃx trie and the reverse trie. This isbecause our implementation of DFUDS stores a boosting table of constant size 1kiB,and for these texts the number of phrases n (cid:48) is less than 100, being the space ofthe table considerably larger than the space of the remaining data structures. Wealso note that our implementation stores the labels of the tree using 1 byte for eachsymbol.The extract time of LZ-End index is better than RLCSA in all texts, being at leasttwice as fast, and up to 10 times faster for short passages. The LZ77 index extractssubstrings of length up to 50 faster than RLCSA. When taking into account thespace/time trade-oﬀs (top-left plot of Figures 6.2-6.7 and A.1-A.26 labeled “results(2)”) we see that our indexes improve RLCSA both in extraction time and space,dominating the curve deﬁned by RLCSA, excepting the texts where LZ-End loses inspace, in which case no one dominates. The extraction space/time trade-oﬀ is betterthan that of RLCSA, because since RLCSA cannot compress the sampling, it has touse a sparse sampling to be competitive in space.The performance of locate queries is related to the pattern length. This is becausein our indexes the locating cost is quadratic on the pattern length (yet, this is in theworst case; in practice many searches are abandoned earlier). It can be seen in plots2-6 of Figures 6.2-6.7 and A.1-A.26 labeled “results (2)” that for patterns of length2 or 4 all of our indexes are signiﬁcantly faster than RLCSA. This is because ourindexes are much faster to locate each occurrence, as this is the cost that dominatesfor short patterns. However, when we increase the length of the patterns, the increasein cost is noticeable for the alternatives using binary search, which are those using102 .2 Analysis of the Results Chapter 6 Experimental Evaluation least space. However, the alternative number 1, using tries, shows a time basicallyindependent of | P | (except somewhat in DNA 0.1% and Escherichia Coli). This isalso seen in the top-right plot of Figures 6.2-6.7 and A.1-A.26 labeled “results (1)”. RLCSA time is almost insensitive to | P | , thus in several cases it becomes fasterfor longer patterns (which also have fewer occurrences, for reporting which RLCSAis slower).By analyzing the performance results of SLPs [CFMPN10] it is clear that thecompression ratio of SLPs (at least when using Re-Pair to create the grammar) isworse than that of RLCSA. For the case of DNA (Para, Cere and Inﬂuenza) thecompression ratio is more than twice that of the LZ77 indexes. Furthermore, thelocate time of SLPs is only comparable to RLCSA for small patterns ( | P | ≤ exists queries are solved consistently faster by RLCSA thanby our indexes. Looking at plots 3 and 4 of Figures 6.2-6.7 and A.1-A.26 labeled“results (1)” we notice that our larger variants are comparable to RLCSA, althoughalways slower, in the case of patterns present in the text. The diﬀerence widens inthe case of non-existent patterns, as RLCSA improves more sharply. Moreover, inour indexes the time increases with the length, opposite to RLCSA where the timeis practically constant when the pattern does not exist. In plots 7 and 8 of Figures6.2-6.7 and A.1-A.26 labeled “results (2)” one can see the trade-oﬀ of exists queries.They show that binary search is not an alternative if we are interested in this typeof queries. For the case of patterns present in the text, binary searching the queriestakes about 10 times more, and for patterns not present in the text about 10-1000times more, than the time needed using tries.103 hapter 7Conclusions We have presented a new full-text self-index based on the Lempel-Ziv parsing. Thisindex is especially well suited for applications in which the text is highly repetitive andthe user is interested in ﬁnding patterns in the text ( locate ) and accessing arbitrarysubstrings of the text ( extract ). Our indexes provide a much better space/time trade-oﬀ than the previous ones for these operations.The compression ratio of our indexes is more than 10 times better than previousindexes based on LZ78, which are shown to be inappropriate for very repetitive texts.Additionally, the compression ratio of our smallest index is, for almost all texts (13out of 16) , better than the lower bound achievable using RLCSA [SVMN08], the bestprevious self-index for these texts. When compared to the smallest practical RLCSA,the compression of our index is better for all except one text, usually by a factor of 2at least. Compared to pure LZ77 compression, our index takes usually 3-6 times thespace achieved by p7zip.We also introduced a new LZ-parsing called LZ-End, which is close to LZ77 incompression ratio and gives faster access to text substrings. The extraction speedwhen using LZ-End is always better than that of RLCSA, and the extraction speedof our LZ77-based index is also superior for small substrings. Our indexes are alwaysbetter for locating the occurrences of short patterns (of length up to 10), and theresults are mixed for longer ones. This is because our locating time is quadratic anddepends also in the extraction speed, which shows diﬀerent behaviors according tothe text. We could not devise any characteristic that explains why the compression ratio of RLCSA forthose texts is superior to our indexes. hapter 7 Conclusions

The only operation for which RLCSA is consistently better than our indexes isfor answering if a pattern is present in the text ( exists ), the diﬀerence being evenmore notorious for the case of non-existent patterns. Similarly, our indexes cannotcount the number of occurrences without locating them all, whereas the RLCSA cando this very fast. Nevertheless, it has been argued [AN10] that these two queries areused in much more speciﬁc applications serving as a basis for complex tasks such asapproximate pattern matching or text categorization, while extracting and locatingare the most important for general applications.An interesting goal for future research would be to reduce the m factor of the locate query time to just m . This improvement would make our index even moreattractive. This has been achieved for other LZ-based indexes [AN07, RO08], yetthese have been built on LZ parsings that are too weak for very repetitive texts.Another line of research would be to design new LZ-like compression schemesallowing fast decompression of random substrings of the LZ parsing. Note that ouronly trade-oﬀ related to extraction speed is the use of LZ-End instead of LZ77, andstill LZ-End takes constant time per extracted symbol only in certain cases. In arecent work, Kuruppu et al. [KPZ10] use a single document as the dictionary forthe LZ77 algorithm, storing that source document in plain form. This method isa heuristic and works fairly well enough only when the documents of the collectionare not successive versions, as in collections of DNA. However, even the compressionachieved for DNA collections (para and cere) is almost the same than the compressionwe achieved using our best LZ77 variant, yet we have a self-index and they are onlyable to extract text, although their extraction times are more than 100 times fasterthan ours. Nevertheless, this method is orthogonal to our index proposal and wecould build our self-index on top of their compression scheme.Another important line of research is to devise an LZ parsing algorithm that usesspace proportional to that of the ﬁnal compressed text. Currently, to build the LZ77parsing one needs about six times the space of the original text. Although this spaceis lower than that of RLCSA, is still too much to handle very large text collections.Alternatively, a parsing algorithm working in secondary storage would also be usefulto handle very large collections. We are aware that this is a more than challengingtask, as the parsing process is strongly related with dynamic self-indexes for repetitivetexts. That is, if we have a dynamic self-index (or at least an index able to insertstrings at the end) we can easily produce the LZ parsing of the text. Hence, studyinghow we can build a dynamic LZ-based index is a natural research direction.It would be also interesting to study if counting could be answered more eﬃ-ciently, and if more meaningful operations like approximate pattern matching could105 hapter 7 Conclusions be implemented, or if some operations of the suﬃx tree could be simulated on theindex.Another interesting research goal is to decrease the space factor, both in theoryand in practice. Compared to a pure LZ77 compressor, the factor is 4 in theoryand 3-6 in practice, as mentioned. Such a reduction has been achieved for Arroyueloand Navarro’s LZ-index, reducing the factor from 4 [Nav04] to (2 + (cid:15) ) [ANS06] (seeSection 2.14.3). This was possible because there was some redundancy between thecomponents of the index. We could also reduce the factor by coding the bitmapsof the wavelet tree of depths in compressed form [RRR02] (see Section 2.5), sincemost depths are very low in practice and only some are high. However, the spaceimprovement would not be too impressive, since the space of wavelet tree of depthsis just 2-4% of the index size.We have also presented a text corpus oriented to repetitive texts. The main goalof this corpus is to become a reference set in experimentation with this kind of texts.The corpus is publicly available at http://pizzachili.dcc.uchile.cl/repcorpus.html .Finally, our implementation has been left public in the site http://pizzachili.dcc.uchile.cl/indexes/LZ77-index , to promote its use in real-world and researchapplications and to serve as a baseline for future developments in repetitive textindexing. 106 ibliography [ACNS10] Diego Arroyuelo, Rodrigo C´anovas, Gonzalo Navarro, and KunihikoSadakane. Succinct trees in practice. In Proc. 11th Workshop on Algo-rithm Engineering and Experiments (ALENEX) , pages 84–97. SIAMPress, 2010.[AN] Diego Arroyuelo and Gonzalo Navarro. Space-eﬃcient construction ofLempel-Ziv compressed text indexes. Manuscript.[AN07] Diego Arroyuelo and Gonzalo Navarro. Smaller and faster Lempel-Ziv indices. In

Proc. 18th International Workshop on CombinatorialAlgorithms (IWOCA) , pages 11–20. College Publications, UK, 2007.[AN10] Diego Arroyuelo and Gonzalo Navarro. Practical approaches to reducethe space requirement of Lempel-Ziv-based compressed text indices.

ACM Journal of Experimental Algorithmics (ACM JEA) , 2010. Toappear.[ANS06] Diego Arroyuelo, Gonzalo Navarro, and Kunihiko Sadakane. Reducingthe space requirement of LZ-index. In

Proc. 17th Annual Symposiumon Combinatorial Pattern Matching (CPM) , LNCS 4009, pages 319–330, 2006.[ANS10] Diego Arroyuelo, Gonzalo Navarro, and Kunihiko Sadakane. StrongerLempel-Ziv based compressed text indexing.

Algorithmica , 2010. Toappear.[AS99] Jean-Paul Allouche and Jeﬀrey Shallit. The ubiquitous Prouhet-Thue-Morse sequence. In

Proc. 1st International Conference on Sequencesand their Applications (SETA) , pages 1–16. Springer-Verlag, 1999.[B +

08] David R. Bentley et al. Accurate whole human genome sequencingusing reversible terminator chemistry.

Nature , 456(7218):53–59, 2008.107

IBLIOGRAPHY BIBLIOGRAPHY [Ban09] Mohammad Banikazemi. LZB: Data compression with bounded refer-ences. In

Proc. 19th Data Compression Conference (DCC) , page 436.IEEE Computer Society, 2009. Poster.[BDM +

05] David Benoit, Erik D. Demaine, J. Ian Munro, Rajeev Raman,Venkatesh Raman, and S. Srinivasa Rao. Representing trees of higherdegree.

Algorithmica , 43(4):275–292, 2005.[BLN09] Nieves Brisaboa, Susana Ladra, and Gonzalo Navarro. Directly ad-dressable variable-length codes. In

Proc. 16th International Sympo-sium on String Processing and Information Retrieval (SPIRE) , LNCS5721, pages 122–130. Springer, 2009.[BM77] Robert S. Boyer and J. Strother Moore. A fast string searching algo-rithm.

Communications of the ACM , 20(10):762–772, 1977.[BW94] Michael Burrows and David Wheeler. A block sorting lossless datacompression algorithm. Technical Report 124, Digital Equipment Cor-poration, 1994.[CFMPN10] Francisco Claude, Antonio Fari˜na, Miguel Mart´ınez-Prieto, and Gon-zalo Navarro. Compressed q -gram indexing for highly repetitive bio-logical sequences. In Proc. 10th IEEE Conference on Bioinformaticsand Bioengineering (BIBE) , pages 86–91. IEEE Press, 2010.[Cha88] Bernard Chazelle. Functional approach to data structures and its use inmultidimensional searching.

SIAM Journal on Computing , 17(3):427–462, 1988.[CIT08] Maxime Crochemore, Lucian Ilie, and Liviu Tinta. Towards a solutionto the ”runs” conjecture. In

Proc. 19th Annual Symposium on Com-binatorial Pattern Matching (CPM) , pages 290–302. Springer-Verlag,2008.[Cla96] David Clark.

Compact Pat Trees . PhD thesis, University of Waterloo,1996.[CLL +

05] Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prab-hakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem.

IEEE Transactions on Information Theory , 51(7):2554–2576, 2005.[CN09] Francisco Claude and Gonzalo Navarro. Self-indexed text compressionusing straight-line programs. In

Proc. 34th International Symposium

IBLIOGRAPHY BIBLIOGRAPHY on Mathematical Foundations of Computer Science (MFCS) , LNCS5734, pages 235–246. Springer, 2009.[CN10] Francisco Claude and Gonzalo Navarro. Self-indexed grammar-basedcompression.

Fundamenta Informaticae , 2010. to appear.[CPS08] Gang Chen, Simon J. Puglisi, and William F. Smyth. Lempel-Ziv fac-torization using less time & space.

Mathematics in Computer Science ,1(4):605–623, June 2008.[FG89] Edward R. Fiala and Daniel H. Greene. Data compression with ﬁnitewindows.

Communications of the ACM , 32(4):490–505, 1989.[FH07] Johannes Fischer and Volker Heun. A New Succinct Representationof RMQ-Information and Improvements in the Enhanced Suﬃx Array.In

Proc. 1st International Symposium on Combinatorics, Algorithms,Probabilistic and Experimental Methodologies (ESCAPE) , volume 4614of

LNCS 4614 , pages 459–470. Springer-Verlag, 2007.[FM05] Paolo Ferragina and Giovanni Manzini. Indexing compressed text.

Journal of the ACM , 52(4):552–581, 2005.[FMMN07] Paolo Ferragina, Giovanni Manzini, Veli M¨akinen, and GonzaloNavarro. Compressed representations of sequences and full-text in-dexes.

ACM Transactions on Algorithms (TALG) , 3(2):article 20, 2007.[FSS03] Frantisek Franek, R.J. Simpson, and William F. Smyth. The maxi-mum number of runs in a string. In

Proc. Australian Workshop onCombinatorial Algorithms (AWOCA) , pages 26–35, 2003.[GBYS92] Gaston H. Gonnet, Ricardo A. Baeza-Yates, and Tim Snider. Newindices for text: Pat trees and pat arrays. In

Information Retrieval:Data Structures & Algorithms , pages 66–82. Prentice Hall, 1992.[GGMN05] Rodrigo Gonz´alez, Szymon Grabowski, Veli M¨akinen, and GonzaloNavarro. Practical implementation of rank and select queries. In

Poster Proc. Volume of 4th Workshop on Eﬃcient and ExperimentalAlgorithms (WEA) , pages 27–38. CTI Press and Ellinika Grammata,2005.[GGV03] Roberto Grossi, Ankur Gupta, and Jeﬀrey Scott Vitter. High-orderentropy-compressed text indexes. In

Proc. 14th Annual ACM-SIAMSymposium on Discrete Algorithms (SODA) , pages 841–850. SIAMPress, 2003. 109

IBLIOGRAPHY BIBLIOGRAPHY [GN08] Rodrigo Gonz´alez and Gonzalo Navarro. Rank/select on dynamic com-pressed sequences and applications.

Theoretical Computer Science ,410:4414–4422, 2008.[GV05] Roberto Grossi and Jeﬀrey Scott Vitter. Compressed suﬃx arrays andsuﬃx trees with applications to text indexing and string matching.

SIAM Journal of Computing , 35(2):378–407, 2005.[Ham86] Richard Wesley Hamming.

Coding and Information Theory . Prentice-Hall, 1986.[IT06] Shunsuke Inenaga and Masayuki Takeda. On-line linear-time construc-tion of word suﬃx trees. In

Proc. 17th Annual Symposium on Combi-natorial Pattern Matching (CPM) , pages 60–71. Springer-Verlag, 2006.[Jac89] Guy Jacobson. Space-eﬃcient static trees and graphs. In

Annual IEEESymposium on Foundations of Computer Science , pages 549–554. IEEEComputer Society, 1989.[K¨ar99] Juha K¨arkk¨ainen.

Repetition-Based Text Indexes . PhD thesis, Depart-ment of Computer Science, Univeristy of Helsinki, Finland, November1999.[KK99] Roman Kolpakov and Gregory Kucherov. On maximal repetitions inwords.

Journal of Discrete Algorithms , 1:159–186, 1999.[KM99] S. Rao Kosaraju and Giovanni Manzini. Compression of low entropystrings with Lempel-Ziv algorithms.

SIAM Journal on Computing ,29(3):893–911, 1999.[KMP77] Donald E. Knuth, James H. Morris, and Vaughan R. Pratt. Fast pat-tern matching in strings.

SIAM Journal of Computing , 6(2):323–350,1977.[KN10] Sebastian Kreft and Gonzalo Navarro. LZ77-like compression with fastrandom access. In

Proc. 20th Data Compression Conference (DCC) ,pages 239–248, 2010.[KPZ10] Shanika Kuruppu, Simon J. Puglisi, and Justin Zobel. RelativeLempel-Ziv compression of genomes for large-scale storage and re-trieval. In

Proc. 17th International Symposium on String Processingand Information Retrieval (SPIRE) , pages 201–206, 2010.110

IBLIOGRAPHY BIBLIOGRAPHY [KS03] Juha K¨arkk¨ainen and Peter Sanders. Simple linear work suﬃx arrayconstruction. In

Proc. 30th International Colloquium on Automata,Languages and Programming (ICALP) , LNCS 2719, pages 943–955,2003.[KU96a] Juha K¨arkk¨ainen and Esko Ukkonen. Lempel-Ziv parsing andsublinear-size index structures for string matching. In

Proc. 3rd SouthAmerican Workshop on String Processing (WSP) , pages 141–155. Car-leton University Press, 1996.[KU96b] Juha K¨arkk¨ainen and Esko Ukkonen. Sparse suﬃx trees. In

Proc.2nd Annual International Conference on Computing and Combina-torics (COCOON) , pages 219–230. Springer-Verlag, 1996.[LM00] N. Jesper Larsson and Alistair Moﬀat. Oﬀ-line dictionary-based com-pression.

Proc. IEEE , 88(11):1722–1732, 2000.[Lot02] M. Lothaire.

Algebraic Combinatorics on Words . Cambridge Univer-sity Press, 2002.[LZ76] Abraham Lempel and Jacob Ziv. On the complexity of ﬁnite sequences.

IEEE Transactions on Information Theory , 22(1):75–81, 1976.[Mai89] Michael G. Main. Detecting leftmost maximal periodicities.

DiscreteApplied Mathematics , 25(1-2):145–153, 1989.[Man01] Giovanni Manzini. An analysis of the Burrows-Wheeler transform.

Journal of the ACM , 48(3):407–430, 2001.[McC76] Edward M. McCreight. A space-economical suﬃx tree constructionalgorithm.

Journal of the ACM , 32(2):262–272, 1976.[MKI +

08] Wataru Matsubara, Kazuhiko Kusano, Akira Ishino, Hideo Bannai,and Ayumi Shinohara. New lower bounds for the maximum number ofruns in a string. In

Proc. Prague Stringology Conference (PSC) , pages140–145, 2008.[MM93] Udi Manber and Gene Myers. Suﬃx arrays: a new method for on-linestring searches.

SIAM Journal on Computing , 22(5):935–948, 1993.[MN07] Veli M¨akinen and Gonzalo Navarro. Rank and select revisited andextended.

Theoretical Computer Science , 387(3):332–347, 2007. Specialissue on “The Burrows-Wheeler Transform and its Applications”.111

IBLIOGRAPHY BIBLIOGRAPHY [MNSV10] Veli M¨akinen, Gonzalo Navarro, Jouni Sir´en, and Niko V¨alim¨aki. Stor-age and retrieval of highly repetitive sequence collections.

Journal ofComputational Biology , 17(3):281–308, 2010.[Mor68] Donald R. Morrison. Patricia-practical algorithm to retrieve informa-tion coded in alphanumeric.

Journal of the ACM , 15(4):514–534, 1968.[MR01] J. Ian Munro and Venkatesh Raman. Succinct representation of bal-anced parentheses and static trees.

SIAM Journal on Computing ,31(3):762–776, 2001.[MRRR03] J. Ian Munro, Rajeev Raman, Venkatesh Raman, and S. SrinivasaRao. Succinct representations of permutations. In

Proc. 30th In-ternational Colloquium on Automata, Languages and Computation(ICALP) , LNCS 2719, pages 345–356. Springer, 2003.[Mun86] J. Ian Munro. An implicit data structure supporting insertion, deletion,and search in O (log n ) time. Journal of Computer System Sciences ,33(1):66–74, 1986.[Nav04] Gonzalo Navarro. Indexing text using the Ziv-Lempel trie.

Journal ofDiscrete Algorithms , 2(1):87–114, 2004.[Nav08] Gonzalo Navarro. Indexing LZ77: The next step in self-indexing.

Keynote talk at Third Workshop on Compression, Text, and Algo-rithms , 2008.[Nav09] Gonzalo Navarro. Implementing the LZ-index: Theory versus practice.

ACM Journal of Experimental Algorithmics (JEA) , 13:article 2, 2009.[NM07] Gonzalo Navarro and Veli M¨akinen. Compressed full-text indexes.

ACM Computing Surveys , 39(1):article 2, 2007.[OS07] Daisuke Okanohara and Kunihiko Sadakane. Practical entropy-compressed rank/select dictionary. In

Proc. 9th Workshop on Algo-rithm Engineering and Experiments (ALENEX) . SIAM Press, 2007.[OS08] Daisuke Okanohara and Kunihiko Sadakane. An online algorithmfor ﬁnding the longest previous factors. In

Proc. 16th Annual Eu-ropean Symposium on Algorithms (ESA) , pages 696–707. Springer-Verlag, 2008. 112

IBLIOGRAPHY BIBLIOGRAPHY [PST07] Simon J. Puglisi, William F. Smyth, and Andrew H. Turpin. A taxon-omy of suﬃx array construction algorithms.

ACM Computing Surveys ,39(2):4, 2007.[PWZ92] Eli Plotnik, Marcelo Weinberger, and Jacob Ziv. Upper bounds onthe probability of sequences emitted by ﬁnite-state sources and on theredundancy of the Lempel-Ziv algorithm.

IEEE Transactions on In-formation Theory , 38(1):66–72, 1992.[RNO08] Lu´ıs M. S. Russo, Gonzalo Navarro, and Arlindo L. Oliveira. Fully-compressed suﬃx trees. In

Proc. 8th Latin American Symposium onTheoretical Informatics (LATIN) , LNCS 4957, pages 362–373, 2008.[RO08] Lu´ıs M. S. Russo and Arlindo L. Oliveira. A compressed self-indexusing a Ziv-Lempel dictionary.

Journal of Information Retrieval ,5(3):501–513, 2008. Special issue SPIRE 2006.[RRR02] Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. Succinctindexable dictionaries with applications to encoding k -ary trees andmultisets. In Proc. 13th Annual ACM-SIAM Symposium on DiscreteAlgorithms (SODA) , pages 233–242. SIAM Press, 2002.[Ryt03] Wojciech Rytter. Application of lempel–ziv factorization to the approx-imation of grammar-based compression.

Theoretical Computer Science ,302(1-3):211–222, 2003.[Sad03] Kunihiko Sadakane. New text indexing functionalities of the com-pressed suﬃx arrays.

Journal of Algorithms , 48(2):294 – 313, 2003.[SS82] James A. Storer and Thomas G. Szymanski. Data compression viatextual substitution.

Journal of the ACM , 29(4):928–951, 1982.[SVMN08] Jouni Sir´en, Niko V¨alim¨aki, Veli M¨akinen, and Gonzalo Navarro. Run-length compressed indexes are superior for highly repetitive sequencecollections. In

Proc. 15th International Symposium on String Process-ing and Information Retrieval (SPIRE) , LNCS 5280, pages 164–175.Springer, 2008.[Ukk95] Esko Ukkonen. Constructing suﬃx trees on-line in linear time.

Algo-rithmica , 14(3):249–260, 1995.[Wei73] Peter Weiner. Linear pattern matching algorithms. In

Proc. 14th An-nual Symposium on Switching and Automata Theory , pages 1–11, 1973.113

IBLIOGRAPHY BIBLIOGRAPHY [Wel84] Terry A. Welch. A Technique for High-Performance Data Compression.

Computer , 17(6):8–19, 1984.[Wil91] Ross N. Williams. An extremely fast ziv-lempel data compression al-gorithm. In

Data Compression Conference , pages 362–371, 1991.[WZ99] Hugh E. Williams and Justin Zobel. Compressing integers for fast ﬁleaccess.

Computer Journal , 42(3):193–201, 1999.[ZdMNBY00] Nivio Ziviani, Edleno Silva de Moura, Gonzalo Navarro, and RicardoBaeza-Yates. Compression: A key for next-generation text retrievalsystems.

IEEE Computer , 33(11):37–44, 2000.[ZL77] Jacob Ziv and Abraham Lempel. A universal algorithm for sequen-tial data compression.

IEEE Transactions on Information Theory ,23(3):337–343, 1977.[ZL78] Jacob Ziv and Abraham Lempel. Compression of individual sequencesvia variable-rate coding.

IEEE Transactions on Information Theory ,24(5):530–536, 1978. 114 ppendix AExperimental Results

In this appendix we present the results of the experiments described in Section 6.1for the remaining texts. E x t r a c t i on s peed ( M c ha r s / s ) log(Snippet Length) Extraction SpeedF RLCSA

RLCSA

RLCSA LZ77LZ-End 0.001 0.01 0.1 5 10 15 20 25 30 35 40 T i m e ( m s / o ccs ) Pattern Length

Locate TimeF RLCSA

RLCSA

RLCSA LZ77 LZ77 LZ-End LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns FoundF RLCSALZ77 LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns not FoundF RLCSALZ77 LZ-End Figure A.1: F results (1). Note the logscales.115 ppendix A Experimental Results T i m e ( µ s / c ha r) Compression Ratio

Extract Time (|P|=2 )F RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.0001 0.001 0.01 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=10)F RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.0001 0.001 0.01 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=15)F RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.0001 0.001 0.01 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=20)F RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.0001 0.001 0.01 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=2)F RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.0001 0.001 0.01 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=4)F RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 0.0001 0.001 0.01 0.1 1 10 100 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns Found (|P|=20)F RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 0.0001 0.001 0.01 0.1 1 10 100 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns not Found (|P|=20)F RLCSALZ77LZ-End

Figure A.2: F results (2). Note the logscales.116 ppendix A Experimental Results E x t r a c t i on s peed ( M c ha r s / s ) log(Snippet Length) Extraction SpeedR RLCSA

RLCSA

RLCSA LZ77LZ-End 0.0001 0.001 0.01 0.1 5 10 15 20 25 30 35 40 T i m e ( m s / o ccs ) Pattern Length

Locate TimeR RLCSA

RLCSA

RLCSA LZ77 LZ77 LZ-End LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns FoundR RLCSALZ77 LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns not FoundR RLCSALZ77 LZ-End Figure A.3: R results (1). Note the logscales.117 ppendix A Experimental Results T i m e ( µ s / c ha r) Compression Ratio

Extract Time (|P|=2 )R RLCSALZ77LZ-End 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.0001 0.001 0.01 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=10)R RLCSALZ77LZ-End 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.0001 0.001 0.01 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=15)R RLCSALZ77LZ-End 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.0001 0.001 0.01 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=20)R RLCSALZ77LZ-End 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.0001 0.001 0.01 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=2)R RLCSALZ77LZ-End 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.0001 0.001 0.01 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=4)R RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 0.0001 0.001 0.01 0.1 1 10 100 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns Found (|P|=20)R RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 0.0001 0.001 0.01 0.1 1 10 100 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns not Found (|P|=20)R RLCSALZ77LZ-End

Figure A.4: R results (2). Note the logscales.118 ppendix A Experimental Results E x t r a c t i on s peed ( M c ha r s / s ) log(Snippet Length) Extraction SpeedProteins 0.1% RLCSA

RLCSA

RLCSA LZ77LZ-End 0.001 0.01 0.1 1 5 10 15 20 25 30 35 40 T i m e ( m s / o ccs ) Pattern Length

Locate TimeProteins 0.1% RLCSA

RLCSA

RLCSA LZ77 LZ77 LZ-End LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns FoundProteins 0.1% RLCSALZ77 LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns not FoundProteins 0.1% RLCSALZ77 LZ-End Figure A.5: Proteins 0.1% results (1). Note the logscales.119 ppendix A Experimental Results T i m e ( m s / c ha r) Compression Ratio

Extract Time (|P|=2 )Proteins 0.1% RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=10)Proteins 0.1% RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=15)Proteins 0.1% RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=20)Proteins 0.1% RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=2)Proteins 0.1% RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=4)Proteins 0.1% RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 2 4 6 8 10 12 14 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns Found (|P|=20)Proteins 0.1% RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 2 4 6 8 10 12 14 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns not Found (|P|=20)Proteins 0.1% RLCSALZ77LZ-End

Figure A.6: Proteins 0.1% results (2). Note the logscales.120 ppendix A Experimental Results E x t r a c t i on s peed ( M c ha r s / s ) log(Snippet Length) Extraction SpeedEnglish 0.1% RLCSA

RLCSA

RLCSA LZ77LZ-End 0.001 0.01 0.1 1 5 10 15 20 25 30 35 40 T i m e ( m s / o ccs ) Pattern Length

Locate TimeEnglish 0.1% RLCSA

RLCSA

RLCSA LZ77 LZ77 LZ-End LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns FoundEnglish 0.1% RLCSALZ77 LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns not FoundEnglish 0.1% RLCSALZ77 LZ-End Figure A.7: English 0.1% results (1). Note the logscales.121 ppendix A Experimental Results T i m e ( m s / c ha r) Compression Ratio

Extract Time (|P|=2 )English 0.1% RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=10)English 0.1% RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=15)English 0.1% RLCSALZ77LZ-End 0 0.02 0.04 0.06 0.08 0.1 0.12 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=20)English 0.1% RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=2)English 0.1% RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=4)English 0.1% RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 2 4 6 8 10 12 14 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns Found (|P|=20)English 0.1% RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 2 4 6 8 10 12 14 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns not Found (|P|=20)English 0.1% RLCSALZ77LZ-End

Figure A.8: English 0.1% results (2). Note the logscales.122 ppendix A Experimental Results E x t r a c t i on s peed ( M c ha r s / s ) log(Snippet Length) Extraction SpeedSources 0.1% RLCSA

RLCSA

RLCSA LZ77LZ-End 0.001 0.01 0.1 1 5 10 15 20 25 30 35 40 T i m e ( m s / o ccs ) Pattern Length

Locate TimeSources 0.1% RLCSA

RLCSA

RLCSA LZ77 LZ77 LZ-End LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns FoundSources 0.1% RLCSALZ77 LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns not FoundSources 0.1% RLCSALZ77 LZ-End Figure A.9: Sources 0.1% results (1). Note the logscales.123 ppendix A Experimental Results T i m e ( m s / c ha r) Compression Ratio

Extract Time (|P|=2 )Sources 0.1% RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=10)Sources 0.1% RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=15)Sources 0.1% RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=20)Sources 0.1% RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=2)Sources 0.1% RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=4)Sources 0.1% RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 2 4 6 8 10 12 14 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns Found (|P|=20)Sources 0.1% RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 2 4 6 8 10 12 14 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns not Found (|P|=20)Sources 0.1% RLCSALZ77LZ-End

Figure A.10: Sources 0.1% results (2). Note the logscales.124 ppendix A Experimental Results E x t r a c t i on s peed ( M c ha r s / s ) log(Snippet Length) Extraction SpeedPara

RLCSA

RLCSA LZ77LZ-End 0.001 0.01 0.1 1 5 10 15 20 25 30 35 40 T i m e ( m s / o ccs ) Pattern Length

Locate TimePara

RLCSA

RLCSA LZ77 LZ77 LZ-End LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns FoundPara

RLCSALZ77 LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns not FoundPara

RLCSALZ77 LZ-End Figure A.11: Para results (1). Note the logscales.125 ppendix A Experimental Results T i m e ( m s / c ha r) Compression Ratio

Extract Time (|P|=2 )Para RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0 5 10 15 20 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=10)Para

RLCSALZ77LZ-End 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0 5 10 15 20 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=15)Para

RLCSALZ77LZ-End 0 0.05 0.1 0.15 0.2 0.25 0 5 10 15 20 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=20)Para

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 0 5 10 15 20 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=2)Para

RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0 5 10 15 20 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=4)Para

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 5 10 15 20 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns Found (|P|=20)Para

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 5 10 15 20 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns not Found (|P|=20)Para

RLCSALZ77LZ-End

Figure A.12: Para results (2). Note the logscales.126 ppendix A Experimental Results E x t r a c t i on s peed ( M c ha r s / s ) log(Snippet Length) Extraction SpeedCere

RLCSA

RLCSA LZ77LZ-End 0.001 0.01 0.1 1 5 10 15 20 25 30 35 40 T i m e ( m s / o ccs ) Pattern Length

Locate TimeCere

RLCSA

RLCSA LZ77 LZ77 LZ-End LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns FoundCere

RLCSALZ77 LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns not FoundCere

RLCSALZ77 LZ-End Figure A.13: Cere results (1). Note the logscales.127 ppendix A Experimental Results T i m e ( m s / c ha r) Compression Ratio

Extract Time (|P|=2 )Cere RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0 5 10 15 20 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=10)Cere

RLCSALZ77LZ-End 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0 5 10 15 20 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=15)Cere

RLCSALZ77LZ-End 0 0.05 0.1 0.15 0.2 0.25 0 5 10 15 20 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=20)Cere

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 0 5 10 15 20 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=2)Cere

RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0 5 10 15 20 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=4)Cere

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 5 10 15 20 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns Found (|P|=20)Cere

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 5 10 15 20 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns not Found (|P|=20)Cere

RLCSALZ77LZ-End

Figure A.14: Cere results (2). Note the logscales.128 ppendix A Experimental Results E x t r a c t i on s peed ( M c ha r s / s ) log(Snippet Length) Extraction SpeedInfluenza

RLCSA

RLCSA LZ77LZ-End 0.001 0.01 0.1 1 5 10 15 20 25 30 35 40 T i m e ( m s / o ccs ) Pattern Length

Locate TimeInfluenza

RLCSA

RLCSA LZ77 LZ77 LZ-End LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns FoundInfluenza

RLCSALZ77 LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns not FoundInfluenza

RLCSALZ77 LZ-End Figure A.15: Inﬂuenza results (1). Note the logscales.129 ppendix A Experimental Results T i m e ( m s / c ha r) Compression Ratio

Extract Time (|P|=2 )Influenza RLCSALZ77LZ-End 0 0.02 0.04 0.06 0.08 0.1 0.12 0 2 4 6 8 10 12 14 16 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=10)Influenza

RLCSALZ77LZ-End 0 0.02 0.04 0.06 0.08 0.1 0.12 0 2 4 6 8 10 12 14 16 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=15)Influenza

RLCSALZ77LZ-End 0 0.02 0.04 0.06 0.08 0.1 0.12 0 2 4 6 8 10 12 14 16 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=20)Influenza

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 0 2 4 6 8 10 12 14 16 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=2)Influenza

RLCSALZ77LZ-End 0 0.02 0.04 0.06 0.08 0.1 0.12 0 2 4 6 8 10 12 14 16 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=4)Influenza

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 2 4 6 8 10 12 14 16 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns Found (|P|=20)Influenza

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 2 4 6 8 10 12 14 16 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns not Found (|P|=20)Influenza

RLCSALZ77LZ-End

Figure A.16: Inﬂuenza results (2). Note the logscales.130 ppendix A Experimental Results E x t r a c t i on s peed ( M c ha r s / s ) log(Snippet Length) Extraction SpeedEscherichia Coli

RLCSA

RLCSA LZ77LZ-End 0.001 0.01 0.1 1 10 5 10 15 20 25 30 35 40 T i m e ( m s / o ccs ) Pattern Length

Locate TimeEscherichia Coli

RLCSA

RLCSA LZ77 LZ77 LZ-End LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns FoundEscherichia Coli

RLCSALZ77 LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns not FoundEscherichia Coli

RLCSALZ77 LZ-End Figure A.17: Escherichia Coli results (1). Note the logscales.131 ppendix A Experimental Results T i m e ( m s / c ha r) Compression Ratio

Extract Time (|P|=2 )Escherichia Coli RLCSALZ77LZ-End 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0 5 10 15 20 25 30 35 40 45 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=10)Escherichia Coli

RLCSALZ77LZ-End 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 5 10 15 20 25 30 35 40 45 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=15)Escherichia Coli

RLCSALZ77LZ-End 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 5 10 15 20 25 30 35 40 45 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=20)Escherichia Coli

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 0 5 10 15 20 25 30 35 40 45 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=2)Escherichia Coli

RLCSALZ77LZ-End 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0 5 10 15 20 25 30 35 40 45 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=4)Escherichia Coli

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 5 10 15 20 25 30 35 40 45 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns Found (|P|=20)Escherichia Coli

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 5 10 15 20 25 30 35 40 45 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns not Found (|P|=20)Escherichia Coli

RLCSALZ77LZ-End

Figure A.18: Escherichia Coli results (2). Note the logscales.132 ppendix A Experimental Results E x t r a c t i on s peed ( M c ha r s / s ) log(Snippet Length) Extraction SpeedCoreutils

RLCSA

RLCSA LZ77LZ-End 0.001 0.01 0.1 1 5 10 15 20 25 30 35 40 T i m e ( m s / o ccs ) Pattern Length

Locate TimeCoreutils

RLCSA

RLCSA LZ77 LZ77 LZ-End LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns FoundCoreutils

RLCSALZ77 LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns not FoundCoreutils

RLCSALZ77 LZ-End Figure A.19: Coreutils results (1). Note the logscales.133 ppendix A Experimental Results T i m e ( m s / c ha r) Compression Ratio

Extract Time (|P|=2 )Coreutils RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0 2 4 6 8 10 12 14 16 18 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=10)Coreutils

RLCSALZ77LZ-End 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0 2 4 6 8 10 12 14 16 18 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=15)Coreutils

RLCSALZ77LZ-End 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0 2 4 6 8 10 12 14 16 18 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=20)Coreutils

RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0 2 4 6 8 10 12 14 16 18 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=2)Coreutils

RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0 2 4 6 8 10 12 14 16 18 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=4)Coreutils

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 2 4 6 8 10 12 14 16 18 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns Found (|P|=20)Coreutils

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 2 4 6 8 10 12 14 16 18 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns not Found (|P|=20)Coreutils

RLCSALZ77LZ-End

Figure A.20: Coreutils results (2). Note the logscales.134 ppendix A Experimental Results E x t r a c t i on s peed ( M c ha r s / s ) log(Snippet Length) Extraction SpeedEinstein (en)

RLCSA

RLCSA LZ77LZ-End 0.001 0.01 0.1 5 10 15 20 25 30 35 40 T i m e ( m s / o ccs ) Pattern Length

Locate TimeEinstein (en)

RLCSA

RLCSA LZ77 LZ77 LZ-End LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns FoundEinstein (en)

RLCSALZ77 LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns not FoundEinstein (en)

RLCSALZ77 LZ-End Figure A.21: Einstein (en) results (1). Note the logscales.135 ppendix A Experimental Results T i m e ( µ s / c ha r) Compression Ratio

Extract Time (|P|=2 )Einstein (en) RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.012 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=10)Einstein (en)

RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.012 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=15)Einstein (en)

RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.012 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=20)Einstein (en)

RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=2)Einstein (en)

RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.012 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=4)Einstein (en)

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0.1 1 10 100 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns Found (|P|=20)Einstein (en)

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 100 0.1 1 10 100 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns not Found (|P|=20)Einstein (en)

RLCSALZ77LZ-End

Figure A.22: Einstein (en) results (2). Note the logscales.136 ppendix A Experimental Results E x t r a c t i on s peed ( M c ha r s / s ) log(Snippet Length) Extraction SpeedEinstein (de)

RLCSA

RLCSA LZ77LZ-End 0.001 0.01 0.1 5 10 15 20 25 30 35 40 T i m e ( m s / o ccs ) Pattern Length

Locate TimeEinstein (de)

RLCSA

RLCSA LZ77 LZ77 LZ-End LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns FoundEinstein (de)

RLCSALZ77 LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns not FoundEinstein (de)

RLCSALZ77 LZ-End Figure A.23: Einstein (de) results (1). Note the logscales.137 ppendix A Experimental Results T i m e ( µ s / c ha r) Compression Ratio

Extract Time (|P|=2 )Einstein (de) RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=10)Einstein (de)

RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=15)Einstein (de)

RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=20)Einstein (de)

RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=2)Einstein (de)

RLCSALZ77LZ-End 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.1 1 10 100 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=4)Einstein (de)

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0.1 1 10 100 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns Found (|P|=20)Einstein (de)

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0.1 1 10 100 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns not Found (|P|=20)Einstein (de)

RLCSALZ77LZ-End

Figure A.24: Einstein (de) results (2). Note the logscales.138 ppendix A Experimental Results E x t r a c t i on s peed ( M c ha r s / s ) log(Snippet Length) Extraction SpeedWorld Leaders

RLCSA

RLCSA LZ77LZ-End 0.001 0.01 0.1 1 5 10 15 20 25 30 35 40 T i m e ( m s / o ccs ) Pattern Length

Locate TimeWorld Leaders

RLCSA

RLCSA LZ77 LZ77 LZ-End LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns FoundWorld Leaders

RLCSALZ77 LZ-End T i m e ( m s / pa tt e r n ) log(Pattern Length/5) Exist Time for Patterns not FoundWorld Leaders

RLCSALZ77 LZ-End Figure A.25: World Leaders results (1). Note the logscales.139 ppendix A Experimental Results T i m e ( m s / c ha r) Compression Ratio

Extract Time (|P|=2 )World Leaders RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=10)World Leaders

RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=15)World Leaders

RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=20)World Leaders

RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=2)World Leaders

RLCSALZ77LZ-End 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0 2 4 6 8 10 12 14 T i m e ( m s / o cc ) Compression Ratio

Locate Time (|P|=4)World Leaders

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 2 4 6 8 10 12 14 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns Found (|P|=20)World Leaders

RLCSALZ77LZ-End 0.0001 0.001 0.01 0.1 1 10 0 2 4 6 8 10 12 14 T i m e ( m s / pa tt e r n ) Compression Ratio

Exist Time for Patterns not Found (|P|=20)World Leaders