Diego Arroyuelo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Diego Arroyuelo is active.

Explore More

Publication

Featured researches published by Diego Arroyuelo.

Algorithmica | 2012

Stronger Lempel-Ziv Based Compressed Text Indexing

Diego Arroyuelo; Gonzalo Navarro; Kunihiko Sadakane

Given a text T[1..u] over an alphabet of size σ, the full-text search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text searching is that of compressed full-text self-indices, which replace the text with a more space-efficient representation of it, at the same time providing indexed access to the text. Thus, we can provide efficient access within compressed space.The Lempel-Ziv index (LZ-index) of Navarro is a compressed full-text self-index able to represent T using 4uHk(T)+o(ulog σ) bits of space, where Hk(T) denotes the k-th order empirical entropy of T, for any k=o(log σu). This space is about four times the compressed text size. The index can locate all the occ occurrences of a pattern P in T in O(m3log σ+(m+occ)log u) worst-case time. Although this index has proven very competitive in practice, the O(m3log σ) term can be excessive for long patterns. Also, the factor 4 in its space complexity makes it larger than other state-of-the-art alternatives.In this paper we present stronger Lempel-Ziv based indices (LZ-indices), improving the overall performance of the original LZ-index. We achieve indices requiring (2+ε)uHk(T)+o(ulog σ) bits of space, for any constant ε>0, which makes them the smallest existing LZ-indices. We simultaneously improve the search time to O(m2+(m+occ)log u), which makes our indices very competitive with state-of-the-art alternatives. Our indices support displaying any text substring of length ℓ in optimal O(ℓ/log σu) time. In addition, we show how the space can be squeezed to (1+ε)uHk(T)+o(ulog σ) to obtain a structure with O(m2) average search time for m≥2log σu. Alternatively, the search time of LZ-indices can be improved to O((m+occ)log u) with (3+ε)uHk(T)+o(ulog σ) bits of space, which is much less than the space needed by other Lempel-Ziv-based indices achieving the same search time. Overall our indices stand out as a very attractive alternative for space-efficient indexed text searching.

combinatorial pattern matching | 2007

A Lempel-Ziv text index on secondary storage

Diego Arroyuelo; Gonzalo Navarro

Full-text searching consists in locating the occurrences of a given pattern P[1..m] in a text T[1..u], both sequences over an alphabet of size σ. In this paper we define a new index for full-text searching on secondary storage, based on the Lempel-Ziv compression algorithm and requiring 8uHk +o(u log σ) bits of space, where Hk denotes the k-th order empirical entropy of T, for any k = o(logσ u). Our experimental results show that our index is significantly smaller than any other practical secondary-memory data structure: 1.4-2.3 times the text size including the text, which means 39%-65% the size of traditional indexes like String B-trees [Ferragina and Grossi, JACM 1999]. In exchange, our index requires more disk access to locate the pattern occurrences. Our index is able to report up to 600 occurrences per disk access, for a disk page of 32 kilobytes. If we only need to count pattern occurrences, the space can be reduced to about 1.04-1.68 times the text size, requiring about 20-60 disk accesses, depending on the pattern length.

international acm sigir conference on research and development in information retrieval | 2012

To index or not to index: time-space trade-offs in search engines with positional ranking functions

Diego Arroyuelo; Senén González; Mauricio Marin; Mauricio Oyarzún; Torsten Suel

Positional ranking functions, widely used in Web search engines, improve result quality by exploiting the positions of the query terms within documents. However, it is well known that positional indexes demand large amounts of extra space, typically about three times the space of a basic nonpositional index. Textual data, on the other hand, is needed to produce text snippets. In this paper, we study time-space trade-offs for search engines with positional ranking functions and text snippet generation. We consider both index-based and non-index based alternatives for positional data. We aim to answer the question of whether one should index positional data or not. We show that there is a wide range of practical time-space trade-offs. Moreover, we show that both position and textual data can be stored using about 71% of the space used by traditional positional indexes, with a minor increase in query time. This yields considerable space savings and outperforms, both in space and time, recent alternatives from the literature. We also propose several efficient compressed text representations for snippet generation, which are able to use about half of the space of current state-of-the-art alternatives with little impact in query processing time.

combinatorial pattern matching | 2008

An Improved Succinct Representation for Dynamic k-ary Trees

Diego Arroyuelo

k-ary treesare a fundamental data structure in many text-processing algorithms (e.g., text searching). The traditional pointer-based representation of trees is space consuming, and hence only relatively small trees can be kept in main memory. Nowadays, however, many applications need to store a huge amount of information. In this paper we present a succinctrepresentation for dynamic k-ary trees of nnodes, requiring 2n+ nlogk+ o(nlogk) bits of space, which is close to the information-theoretic lower bound. Unlike alternative representations where the operations on the tree can be usually computed in O(logn) time, our data structure is able to take advantage of asymptotically smaller values of k, supporting the basic operations parent and child in O(logk+ loglogn) time, which is o(logn) time whenever logk= o(logn). Insertions and deletions of leaves in the tree are supported in

international symposium on algorithms and computation | 2005

Space-efficient construction of LZ-index

Diego Arroyuelo; Gonzalo Navarro

O((\log{k}+\log\log{n})(1+\frac{\log{k}}{\log{(\log{k} + \log\log{n})}}))

international acm sigir conference on research and development in information retrieval | 2013

Document identifier reassignment and run-length-compressed inverted indexes for improved search performance

Diego Arroyuelo; Senén González; Mauricio Oyarzún; Victor Sepulveda

amortized time. Our representation also supports more specialized operations (like subtreesize , depth , etc.), and provides a new trade-off when k= O(1) allowing faster updates (in O(loglogn) amortized time, versus the amortized time of O((loglogn)1 + i¾?), for i¾?> 0, from Raman and Rao [21]), at the cost of slower basic operations (in O(loglogn) time, versus O(1) time of [21]).

combinatorial pattern matching | 2006

Reducing the space requirement of LZ-Index

Diego Arroyuelo; Gonzalo Navarro; Kunihiko Sadakane

A compressed full-text self-index is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. The LZ-index, in particular, requires 4uHk(1+o(1)) bits of space, where u is the text length in characters and Hk is its k-th order empirical entropy. Although in practice the LZ-index needs 1.0-1.5 times the text size, its construction requires much more main memory (around 5 times the text size), which limits its applicability to large texts. In this paper we present a practical space-efficient algorithm to construct LZ-index, requiring (4+e)uHk+o(u) bits of space, for any constant 0<e<1, and O(σu) time, being σ the alphabet size. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index.

string processing and information retrieval | 2003

Memory-Adaptative Dynamic Spatial Approximation Trees

Diego Arroyuelo; Francisca Muñoz; Gonzalo Navarro; Nora Reyes

Text search engines are a fundamental tool nowadays. Their efficiency relies on a popular and simple data structure: the inverted indexes. Currently, inverted indexes can be represented very efficiently using index compression schemes. Recent investigations also study how an optimized document ordering can be used to assign document identifiers (docIDs) to the document database. This yields important improvements in index compression and query processing time. In this paper we follow this line of research, yet from a different perspective. We propose a docID reassignment method that allows one to focus on a given subset of inverted lists to improve their performance. We then use run-length encoding to compress these lists (as many consecutive 1s are generated). We show that by using this approach, not only the performance of the particular subset of inverted lists is improved, but also that of the whole inverted index. Our experimental results indicate a reduction of about 10% in the space usage of the whole index docID reassignment was focused. Also, decompression speed is up to 1.22 times faster if the runs must be explicitly decompressed and up to 4.58 times faster if implicit decompression of runs is allowed. Finally, we also improve the Document-at-a-Time query processing time of AND queries (by up to 12%), WAND queries (by up to 23%) and full (non-ranked) OR queries (by up to 86%).

ACM Journal of Experimental Algorithms | 2010

Practical approaches to reduce the space requirement of lempel-ziv--based compressed text indices

Diego Arroyuelo; Gonzalo Navarro

The LZ-index is a compressed full-text self-index able to represent a text P1...m, over an alphabet of size

international symposium on algorithms and computation | 2009

Untangled Monotonic Chains and Adaptive Range Search

Diego Arroyuelo; Francisco Claude; Reza Dorrigiv; Stephane Durocher; Meng He; Alejandro López-Ortiz; J. Ian Munro; Patrick K. Nicholson; Alejandro Salinger; Matthew Skala

\sigma = O(\textrm{polylog}(u))

Explore More