Wing-Kai Hon | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Wing-Kai Hon is active.

Explore More

Publication

Featured researches published by Wing-Kai Hon.

foundations of computer science | 2003

Breaking a time-and-space barrier in constructing full-text indices

Wing-Kai Hon; Kunihiko Sadakane; Wing-Kin Sung

Suffix trees and suffix arrays are the most prominent full-text indices, and their construction algorithms are well studied. It has been open for a long time whether these indices can be constructed in both O(n log n) time and O(n log n)-bit working space, where n denotes the length of the text. In the literature, the fastest algorithm runs in O(n) time, while it requires O(n log n)-bit working space. On the other hand, the most space-efficient algorithm requires O(n)-bit working space while it runs in O(n log n) time. This paper breaks the long-standing time-and-space barrier under the unit-cost word RAM. We give an algorithm for constructing the suffix array which takes O(n) time and O(n)-bit working space, for texts with constant-size alphabets. Note that both the time and the space bounds are optimal. For constructing the suffix tree, our algorithm requires O(n log/sup /spl epsi//n) time and O(n)-bit working space for any 0 < /spl epsi/ < 1. Apart from that, our algorithm can also be adopted to build other existing full-text indices, such as Compressed Suffix Tree, Compressed Suffix Arrays and FM-index. We also study the general case where the size of the alphabet A is not constant. Our algorithm can construct a suffix array and a suffix tree using optimal O(n log |A|)-bit working space while running in O(n log log |A|) time and O(n log/sup /spl epsi//n) time, respectively. These are the first algorithms that achieve 0(n log n) time with optimal working space, under a reasonable assumption that log |A| = o(log n).

computing and combinatorics conference | 2002

A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays

Wing-Kai Hon; Tak Wah Lam; Kunihiko Sadakane; Wing-Kin Sung; Siu-Ming Yiu

With the first human DNA being decoded into a sequence of about 2.8 billion characters, much biological research has been centered on analyzing this sequence. Theoretically speaking, it is now feasible to accommodate an index for human DNA in the main memory so that any pattern can be located efficiently. This is due to the recent breakthrough on compressed suffix arrays, which reduces the space requirement from O(n log n) bits to O(n) bits. However, constructing compressed suffix arrays is still not an easy task because we still have to compute suffix arrays first and need a working memory of O(n log n) bits (i.e., more than 13 gigabytes for human DNA). This paper initiates the study of constructing compressed suffix arrays directly from the text. The main contribution is a construction algorithm that uses only O(n) bits of working memory, and the time complexity is O(n log n). Our construction algorithm is also time and space efficient for texts with large alphabets such as Chinese or Japanese. Precisely, when the alphabet size is |Σ|, the working space is O(n log |Σ|) bits, and the time complexity remains O(n log n), which is independent of |Σ|.

ACM Transactions on Algorithms | 2007

Compressed indexes for dynamic text collections

Ho-Leung Chan; Wing-Kai Hon; Tak Wah Lam; Kunihiko Sadakane

Let T be a string with n characters over an alphabet of constant size. A recent breakthrough on compressed indexing allows us to build an index for T in optimal space (i.e., O(n) bits), while supporting very efficient pattern matching [Ferragina and Manzini 2000; Grossi and Vitter 2000]. Yet the compressed nature of such indexes also makes them difficult to update dynamically. This article extends the work on optimal-space indexing to a dynamic collection of texts. Our first result is a compressed solution to the library management problem, where we show an index of O(n) bits for a text collection L of total length n, which can be updated in O(|T| log n) time when a text T is inserted or deleted from L; also, the index supports searching the occurrences of any pattern P in all texts in L in O(|P| log n + occ log2 n) time, where occ is the number of occurrences. Our second result is a compressed solution to the dictionary matching problem, where we show an index of O(d) bits for a pattern collection D of total length d, which can be updated in O(|P| log2 d) time when a pattern P is inserted or deleted from D; also, the index supports searching the occurrences of all patterns of D in any text T in O((|T| + occ)log2 d) time. When compared with the O(d log d)-bit suffix-tree-based solution of Amir et al. [1995], the compact solution increases the query time by roughly a factor of log d only. The solution to the dictionary matching problem is based on a new compressed representation of a suffix tree. Precisely, we give an O(n)-bit representation of a suffix tree for a dynamic collection of texts whose total length is n, which supports insertion and deletion of a text T in O(|T| log2 n) time, as well as all suffix tree traversal operations, including forward and backward suffix links. This work can be regarded as a generalization of the compressed representation of static texts. In the study of the aforementioned result, we also derive the first O(n)-bit representation for maintaining n pairs of balanced parentheses in O(log n/log log n) time per operation, matching the time complexity of the previous O(n log n)-bit solution.

data compression conference | 2008

Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing

Yu-Feng Chien; Wing-Kai Hon; Rahul Shah; J. Scott Vitter

We introduce a new variant of the popular Burrows-Wheeler transform (BWT) called geometric Burrows-Wheeler transform (GBWT). Unlike BWT, which merely permutes the text, GBWT converts the text into a set of points in 2-dimensional geometry. Using this transform, we can answer to many open questions in compressed text indexing: (1) can compressed data structures be designed in external memory with similar performance as the uncompressed counterparts? (2) Can compressed data structures be designed for position restricted pattern matching [16]? We also introduce a reverse transform, called Points2Text; which converts a set of points into text. This transform allows us to derive the first known lower bounds in compressed text indexing. We show strong equivalence between data structural problems in geometric range searching and text pattern matching. This provides a way to derive new results in compressed text indexing by translating the results from range searching.

international acm sigir conference on research and development in information retrieval | 2011

Inverted indexes for phrases and strings

Manish Patil; Sharma V. Thankachan; Rahul Shah; Wing-Kai Hon; Jeffrey Scott Vitter; Sabrina Chandrasekaran

Inverted indexes are the most fundamental and widely used data structures in information retrieval. For each unique word occurring in a document collection, the inverted index stores a list of the documents in which this word occurs. Compression techniques are often applied to further reduce the space requirement of these lists. However, the index has a shortcoming, in that only predefined pattern queries can be supported efficiently. In terms of string documents where word boundaries are undefined, if we have to index all the substrings of a given document, then the storage quickly becomes quadratic in the data size. Also, if we want to apply the same type of indexes for querying phrases or sequence of words, then the inverted index will end up storing redundant information. In this paper, we show the first set of inverted indexes which work naturally for strings as well as phrase searching. The central idea is to exclude document d in the inverted list of a string P if every occurrence of P in d is subsumed by another string of which P is a prefix. With this we show that our space utilization is close to the optimal. Techniques from succinct data structures are deployed to achieve compression while allowing fast access in terms of frequency and document id based retrieval. Compression and speed trade-offs are evaluated for different variants of the proposed index. For phrase searching, we show that our indexes compare favorably against a typical inverted index deploying position-wise intersections. We also show efficient top-k based retrieval under relevance metrics like frequency and tf-idf.

combinatorial pattern matching | 2010

Compression, indexing, and retrieval for massive string data

Wing-Kai Hon; Rahul Shah; Jeffrey Scott Vitter

The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.

combinatorial pattern matching | 2004

Approximate String Matching Using Compressed Suffix Arrays

Trinh N. D. Huynh; Wing-Kai Hon; Tak Wah Lam; Wing-Kin Sung

Let T be a text of length n and P be a pattern of length m, both strings over a fixed finite alphabet A. The k-difference (k-mismatch, respectively) problem is to find all occurrences of P in T that have edit distance (Hamming distance, respectively) at most k from P. In this paper we investigate a well-studied case in which k=1 and T is fixed and preprocessed into an indexing data structure so that any pattern query can be answered faster [16-19]. This paper gives a solution using O(n) bits indexing data structure with O(mlog2 n) query time. To the best of our knowledge, this is the first result which requires linear indexing space. The results can be extended for the k-difference problem with k≥1.

international symposium on algorithms and computation | 2003

Succinct Data Structures for Searchable Partial Sums

Wing-Kai Hon; Kunihiko Sadakane; Wing-Kin Sung

The Searchable Partial Sums is a data structure problem that maintains a sequence of n non-negative k-bit integers; in addition, it allows us to modify the entries by the update operation, while supporting two types of queries: sum and search. Recently, researchers focus on the succinct representation of the data structure in kn+o(kn) bits. They study the tradeoff in time between the query and the update operations, under the word RAM with word size O(lg U) bits. For the special case where k=1 (which is known as Dynamic Bit Vector problem), Raman et al. showed that both queries can be supported in O(logb n) time, while update requires O(b) amortized time for any b with lg n/lg lg n ≤ b ≤ n. This paper generalizes the study and shows that even for k = O(lg lg n), both query and update operations can be maintained using the same time complexities. Also, the time for update becomes worst-case time.

SIAM Journal on Computing | 2009

Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

Wing-Kai Hon; Kunihiko Sadakane; Wing-Kin Sung

Suffix trees and suffix arrays are the most prominent full-text indices, and their construction algorithms are well studied. In the literature, the fastest algorithm runs in

string processing and information retrieval | 2010